pagodo v2.6 releases: Automate Google Hacking Database scraping

by do son · Published March 28, 2018 · Updated January 15, 2024

pagodo (Passive Google Dork) – Automate Google Hacking Database scraping

The goal of this project was to develop a passive Google dork script to collect potentially vulnerable web pages and applications on the Internet. There are 2 parts. The first is ghdb_scraper.py that retrieves Google Dorks and the second portion is pagodo.py that leverages the information gathered by ghdb_scraper.py.

What are Google Dorks?

The awesome folks at Offensive Security maintain the Google Hacking Database (GHDB) found here: https://www.exploit-db.com/google-hacking-database. It is a collection of Google searches, called dorks, that can be used to find potentially vulnerable boxes or other juicy info that is picked up by Google’s search bots.

Changelog v2.6

Bumped yagooglesearch to version 1.9.0

Installation

git clone https://github.com/opsdisk/pagodo.git
pip install -r requirements.txt

Usage

ghdb_scraper.py

To start off, pagodo.py needs a list of all the current Google dorks. Unfortunately, the entire database cannot be easily downloaded. A couple of older projects did this, but the code was slightly stale and it wasn’t multi-threaded…so collecting ~3800 Google Dorks would take a long time. ghdb_scraper.py is the resulting Python script.

ghdb_scraper.py Execution Flow

The flow of execution is pretty simple:

Fill a queue with Google dork numbers to retrieve based off a range
Worker threads retrieve the dork number from the queue, retrieve the page using urllib2, then process the page to extract the Google dork using the BeautifulSoup HTML parsing library
Print the results to the screen and optionally save them to a file (to be used by pagodo.py for example)

ghdb_scraper.py Switches

The script’s switches are self-explanatory:

-n MINDORKNUM     Minimum Google dork number to start at (Default: 5).
-x MAXDORKNUM     Maximum Google dork number, not the total, to retrieve
                  (Default: 5000). It is currently around 3800. There is no
                  logic in this script to determine when it has reached the
                  end.
-d SAVEDIRECTORY  Directory to save downloaded files (Default: cwd, ".")
-s                Save the Google dorks to google_dorks_<TIMESTAMP>.txt file
-t NUMTHREADS     Number of search threads (Default: 3)

To run it

python ghdb_scraper.py -n 5 -x 3785 -s -t 3

pagodo.py

Now that a file with the most recent Google dorks exists, it can be fed into pagodo.py using the -g switch to start collecting potentially vulnerable public applications. pagodo.py leverages the google python library to search Google for sites with the Google dork, such as:

intitle:”ListMail Login” admin -demo

The -d switch can be used to specify a domain and functions as the Google search operator:

site:example.com

pagodo.py Switches

The script’s switches are self-explanatory:

-d DOMAIN       Domain to search for Google dork hits.
-g GOOGLEDORKS  File containing Google dorks, 1 per line.
-j JITTER       jitter factor (multipled times delay value) added to
                randomize lookups times. Default: 1.50
-l SEARCHMAX    Maximum results to search (default 100).
-s              Save the html links to pagodo_results__<TIMESTAMP>.txt file.
-e DELAY        Minimum delay (in seconds) between searches...jitter (up to
                [jitter X delay] value) is added to this value to randomize
                lookups. If it's too small Google may block your IP, too big
                and your search may take a while. Default: 30.0

To run it