freshonions-torscraper: open source TOR spider / hidden service onion crawler
Fresh Onions TOR Hidden Service Crawler
This is a copy of the source for the http://zlal32teyptf4tvi.onion hidden service, which implements a tor hidden service crawler/spider and web site.
- Crawls the darknet looking for new hidden service
- Find hidden services from a number of clearnet sources
- Optional fulltext elasticsearch support
- Marks clone sites of the /r/darknet superlist
- Finds SSH fingerprints across hidden services
- Finds email addresses across hidden services
- Finds bitcoin addresses across hidden services
- Shows incoming/outgoing links to onion domains
- Up-to-date alive/dead hidden service status
- Search for “interesting” URL paths, useful 404 detection
- Automatic language detection
- Fuzzy clone detection (requires elasticsearch, more advanced than superlist clone detection)
Fresh Onions runs on two servers, a frontend host running the database and hidden service web site, and a backend host running the crawler. Probably most interesting to the reader is the setup for the backend. TOR as a client is COMPLETELY SINGLETHREADED. I know! It’s 2017, and along with a complete lack of flying cars, TOR runs in a single thread. What this means is that if you try to run a crawler on a single TOR instance you will quickly find you are maxing out your CPU at 100%.
The solution to this problem is running multiple TOR instances and connecting to them through some kind of frontend that will round-robin your requests. The Fresh Onions crawler runs eight Tor instances.
Debian (and Ubuntu) comes with a useful program “tor-instance-create” for quickly creating multiple instances of TOR. I used Squid as my frontend proxy, but unfortunately, it can’t connect to SOCKS directly, so I used “privoxy” as an intermediate proxy. You will need one privoxy instance for every TOR instance. There is a script in “scripts/create_privoxy.sh” to help with creating privoxy instances on Debian systems. It also helps to replace /etc/privoxy/default.filter with an empty file, to reduce CPU load by removing unnecessary regexes.
Copyright (C) 2017dirtyfilthy