freshonions-torscraper: open source TOR spider / hidden service onion crawler

Fresh Onions TOR Hidden Service Crawler

This is a copy of the source for the http://zlal32teyptf4tvi.onion hidden service, which implements a tor hidden service crawler/spider and web site.

Features

  • Crawls the darknet looking for new hidden service
  • Find hidden services from a number of clearnet sources
  • Optional fulltext elasticsearch support
  • Marks clone sites of the /r/darknet superlist
  • Finds SSH fingerprints across hidden services
  • Finds email addresses across hidden services
  • Finds bitcoin addresses across hidden services
  • Shows incoming/outgoing links to onion domains
  • Up-to-date alive/dead hidden service status
  • Portscanner
  • Search for “interesting” URL paths, useful 404 detection
  • Automatic language detection
  • Fuzzy clone detection (requires elasticsearch, more advanced than superlist clone detection)

Infrastructure

Fresh Onions runs on two servers, a frontend host running the database and hidden service web site, and a backend host running the crawler. Probably most interesting to the reader is the setup for the backend. TOR as a client is COMPLETELY SINGLETHREADED. I know! It’s 2017, and along with a complete lack of flying cars, TOR runs in a single thread. What this means is that if you try to run a crawler on a single TOR instance you will quickly find you are maxing out your CPU at 100%.

The solution to this problem is running multiple TOR instances and connecting to them through some kind of frontend that will round-robin your requests. The Fresh Onions crawler runs eight Tor instances.

Debian (and Ubuntu) comes with a useful program “tor-instance-create” for quickly creating multiple instances of TOR. I used Squid as my frontend proxy, but unfortunately, it can’t connect to SOCKS directly, so I used “privoxy” as an intermediate proxy. You will need one privoxy instance for every TOR instance. There is a script in “scripts/create_privoxy.sh” to help with creating privoxy instances on Debian systems. It also helps to replace /etc/privoxy/default.filter with an empty file, to reduce CPU load by removing unnecessary regexes.

Install && Use

Copyright (C) 2017dirtyfilthy