[Python for PenTester] How to create automatic Web crawling with Scrapy
Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.
Installing Scrapy
pip install Scrapy
Spider crawling process
- Initialized to the initial URL Request, and set the callback function. When the download is complete the request and returns, generate response, and passed as a parameter to the callback function.
- Initial spider request by calling start_requests() to get. start_request() to read start_urls the URL, and parse the callback function generates Request.
- Analysis in the callback function returns (web) content, return Item the object or Request or a two iterations may include a container . After returning to the Request object after Scrapy process, download the appropriate content and calls the callback function set (same function).
- In the callback function, you can use the selector ( Selector , BeautifulSoup, lxml, etc.) to analyze web content and generate item based on an analysis of the data.
- Finally, the spider returned item will be saved to the database.
Example: Spider
Example: CrawlSpider