Programmatically login a Python web crawler
I am going to make a web crawler that will login to a school website using my credentials and then crawl certain parts of the site. I am using the Beautiful Soup Python library
Capture packet & analysis HTTP GET&POST Request
The beginning of their own capture package analysis, there have been some problems. Login process analysis is not rigorous, so that when the simulation log cannot log up when the problem, then the seniors to help me under the analysis. The original logon process requires two POST request process, and in the process, before I use the urllib this library to simulate the login, the use of the tips in the use of the request compared to urllib advanced library.
The first POST request
POST /renzheng.jsp HTTP/1.1
Host: target.com
Content-Length: 142
Cache-Control: max-age=0
Origin: http://target.com
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Referer: http://target.com
Accept-Encoding: gzip, deflate
Accept-Language: en-US,zh;q=0.8
Cookie: UM_distinctid=15b7be26085454-0504ed9f0e367d-3c365402-100200-15b7be26086832; JSESSIONID=213C06E58934DCED50E4E479858CB055
Connection: close
displayName=&displayPasswd=&select=2&submit.x=36&submit.y=14&operType=911&random_form=-1048366953725273893&userName=admin&passwd=ddos
- displayName, displayPasswd
- select, submit.x, submit.y
- operType, random_form
- userName, passwd
The second POST package
POST /servlet/adminservlet HTTP/1.1
Host: target.com
Content-Length: 65
Cache-Control: max-age=0
Origin: http://target.com
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Referer: http://target.com/admin.jsp
Accept-Encoding: gzip, deflate
Accept-Language: en-US,zh;q=0.8
Cookie: UM_distinctid=15b7be26085454-0504ed9f0e367d-3c365402-100200-15b7be26086832; JSESSIONID=213C06E58934DCED50E4E479858CB055
Connection: close
isValidate=false&userName=admin&passwd=ddos&operType=911
We saw that
- userName, passwd: these two parameters is the account password, clear text transmission
-
isValidate defaults to false
-
operType defaults to 911
The HTTP GET request
GET /student/studentInfo.jsp?userName=xxxx&passwd=xxxxx HTTP/1.1
Host: target.com
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Referer: http://target.com/servlet/adminservlet
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,zh;q=0.8
Cookie: UM_distinctid=15b7be26085454-0504ed9f0e367d-3c365402-100200-15b7be26086832; JSESSIONID=213C06E58934DCED50E4E479858CB055
Connection: close
We saw that
- userName, passwd: these two parameters is the account password, clear text transmission
Now, we can start to write our python script.
- Simulation construct request packet
In the process of writing Python code, I try to use the object-oriented process, the relevant variables are defined as private variables, in the process of simulating the use of the requests module
Requests library
Requests is a Python HTTP library that provides a number of HTTP-related methods that we can use dir (requests) to view the methods provided by the library.In this process, we use Session, get, post and content method.
Session
The session object allows you to keep certain parameters across requests. It also keeps cookies between all requests made by the same Session instance>>> s = requests.Session()
>>> r = s.get(“http://target.com/”)GET submission method
Request the URL of the response in GET commit mode
>>> r = requests.get(“http://target/”,proxies=proxies,timeout=0.001,params=payload)
params Pass parameters for GET submission
r = requests.get(“http://httpbin.org/get”, params=payload)
proxies If you need to use a proxy, you can configure a single request by providing proxies for any request method:requests stop after waiting for the number of seconds set with the timeout parameter:>>> requests.get(‘http://github.com’, timeout=0.001)Traceback (most recent call last):File “”, line 1, inrequests.exceptions.Timeout: HTTPConnectionPool(host=’github.com’, port=80): Request timed out. (timeout=0.001)POST submission methodRequest the URL of the response in GET commit moderequests.post(“http://target.com”, header=header, data=data)
Here we only need to add a common part
‘User-Agent’ : ‘Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36’,
‘Content-Type’:’application/x-www-form-urlencoded’,
‘Accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8’,
‘Referer’: ‘http://target.com/’,
‘Accept-Encoding’: ‘gzip, deflate’,
‘Accept-Language’: ‘en-US;q=0.8’data – The information submitted for Post
‘displayName’ : ”,
‘displayPasswd’ : ”,
‘select’: ‘2’,
‘submit.x’: ’43’,
‘submit.y’ : ’12’,
‘operType’ : ‘911’,
‘random_form’ : ‘5129319019753764987’,
‘userName’ : ”,
‘passwd’ : ”The part of the code Request
-
BeautifulSoup library
BeautifulSoup is a Python library that can extract data from HTML or XML files. It can be used to navigate through custom documents, find and modify documents through your favorite converters. You can also use the dir() method to view BeautifulSoup info
Parse it into XML
By default, Beautiful Soup will parse the current document as HTML, and if you want to parse the XML document, add the second argument “xml” to the BeautifulSoup constructor:soup = BeautifulSoup(markup, “xml”)
find_all()
The find_all () method returns all the tags in the document that match the condition, and the result is a list of values that contain an element In the process of practice because the required information is included in tables under the label, because the return is a list, use the index to locate the tr position of the response, and then for the loop output td contentThe code for looping tables is as follows:The part of the BeautifulSoup code
-
Exception handling
The try/except statement is used to detect errors in the try statement block so that the except statement captures exception information and processes it.
- The complete Python code