[Python for PenTester] How to create automatic Web crawling with Scrapy

by do son · Published November 12, 2016 · Updated November 4, 2024

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

Installing Scrapy

pip install Scrapy

Spider crawling process

Initialized to the initial URL Request, and set the callback function. When the download is complete the request and returns, generate response, and passed as a parameter to the callback function.
Initial spider request by calling start_requests() to get. start_request() to read start_urls the URL, and parse the callback function generates Request.
Analysis in the callback function returns (web) content, return Item the object or Request or a two iterations may include a container . After returning to the Request object after Scrapy process, download the appropriate content and calls the callback function set (same function).
In the callback function, you can use the selector ( Selector , BeautifulSoup, lxml, etc.) to analyze web content and generate item based on an analysis of the data.
Finally, the spider returned item will be saved to the database.

Example: Spider

Example: CrawlSpider

[Python for PenTester] How to create automatic Web crawling with Scrapy

Search

Brilliantly

Content & Links