GoogleScraper – Scraping search engines professionally
GoogleScraper parses Google search engine results (and many other search engines _) easily and in a fast way. It allows you to extract all found links and their titles and descriptions programmatically which enables you to process scraped data further.
There are unlimited usage scenarios:
- Quickly harvest masses of google dorks.
- Use it as an SEO tool.
- Discover trends.
- Compile lists of sites to feed your own database.
- Many more use cases…
- Quite easily extendable since the code is well documented
First of all, you need to understand that GoogleScraper uses two completely different scraping approaches:
- Scraping with low-level http libraries such as urllib.request or requests modules. This simulates the http packets sent by real browsers.
- Scrape by controlling a real browser with the selenium framework
Whereas the former approach was implemented first, the latter approach looks much more promising in comparison, because search engines have no easy way of detecting it.
GoogleScraper is implemented with the following techniques/software:
- Written in Python 3.7
- Uses multithreading/asynchronous IO.
- Supports parallel scraping with multiple IP addresses.
- Provides proxy support using socksipy and built-in browser proxies:
- Socks5
- Socks4
- HttpProxy
- Support for alternative search modes like news/image/video search.
What are search engines supported?
Currently, the following search engines are supported:
- Bing
- Yahoo
- Yandex
- Baidu
- Duckduckgo
How does GoogleScraper maximize the amount of extracted information per IP address?
Scraping is a critical and highly complex subject. Google and other search engine giants have a strong inclination to make the scrapers life as hard as possible. There are several ways for the search engine providers to detect that a robot is using their search engine:
- The User-Agent is not one of a browser.
- The search params are not identical to the ones that browser used by a human set:
- Javascript generates challenges dynamically on the client side. This might include heuristics that try to detect human behaviour. Example: Only humans move their mouses and hover over the interesting search results.
- Robots have a strict requests pattern (very fast requests, without a random time between the sent packets).
- Dorks are heavily used
- No pictures/ads/css/javascript is loaded (like a browser does normally) which in turn won’t trigger certain javascript events
So the biggest hurdle to tackle is the javascript detection algorithms. I don’t know what Google does in their javascript, but I will soon investigate it further and then decide if it’s not better to change strategies and switch to an approach that scrapes by simulating browsers in a browserlike environment that can execute javascript. The networking of each of these virtual browsers is proxified and manipulated such that it behaves like a real physical user agent. I am pretty sure that it must be possible to handle 20 such browser sessions in a parallel way without stressing resources too much. The real problem is as always the lack of good proxies…
Install & Use
Copyright (C) 2018 Nikolai Tschacher