GoogleScraper: Scraping search engines professionally
GoogleScraper – Scraping search engines professionally
GoogleScraper parses Google search engine results (and many other search engines _) easily and in a fast way. It allows you to extract all found links and their titles and descriptions programmatically which enables you to process scraped data further.
There are unlimited usage scenarios:
- Quickly harvest masses of google dorks.
- Use it as an SEO tool.
- Discover trends.
- Compile lists of sites to feed your own database.
- Many more use cases…
- Quite easily extendable since the code is well documented
First of all, you need to understand that GoogleScraper uses two completely different scraping approaches:
- Scraping with low-level http libraries such as urllib.request or requests modules. This simulates the http packets sent by real browsers.
- Scrape by controlling a real browser with the selenium framework
Whereas the former approach was implemented first, the latter approach looks much more promising in comparison, because search engines have no easy way of detecting it.
GoogleScraper is implemented with the following techniques/software:
- Written in Python 3.7
- Uses multithreading/asynchronous IO.
- Supports parallel scraping with multiple IP addresses.
- Provides proxy support using socksipy and built-in browser proxies:
- Support for alternative search modes like news/image/video search.
What are search engines supported?
Currently, the following search engines are supported:
How does GoogleScraper maximize the amount of extracted information per IP address?
Scraping is a critical and highly complex subject. Google and other search engine giants have a strong inclination to make the scrapers life as hard as possible. There are several ways for the search engine providers to detect that a robot is using their search engine:
- The User-Agent is not one of a browser.
- The search params are not identical to the ones that browser used by a human set:
- Robots have a strict requests pattern (very fast requests, without a random time between the sent packets).
- Dorks are heavily used
Copyright (C) 2018 Nikolai Tschacher