troll-a: extracting secrets such as passwords, API keys, and tokens from Web ARChive files

finding secrets web archives

Troll-A

Troll-A is a command line tool for extracting secrets such as passwords, API keys, and tokens from WARC (Web ARChive) files. Troll-A is an easy-to-usecomprehensive, and fast solution for finding secrets in web archives.

Features

  • Protocols: Supports retrieving web archives directly from a network server via HTTP/HTTPS, from the Amazon S3 object storage service, or the local file system.
  • Compression: Supports web archives compressed with GZipBZip2, or ZStd. For ZStd, it also supports custom dictionaries prepended to the compressed data stream (as used by *.megawarc.warc.zst files).
  • Comprehensive: Uses the battle-tested ruleset from the Gitleaks project to detect up to 166 different types of secrets, tokens, keys, or other sensitive information.
  • Performance: Works concurrently and optionally uses optimized regular expressions (via go-re2) to process a typical Common Crawl web archive (~34.000 pages) in less than 30 seconds on AWS c7g.12xlarge.

Use

Examples

Common Crawl

Common Crawl maintains a free, open repository of web crawl data that can be used by anyone. The Common Crawl corpus contains petabytes of data collected regularly since 2008.

For example, to extract secrets from all of the 3.35 billion pages of the November/December 2023 crawl (called CC-MAIN-2023-50), you can do this:

# Download the list of all 90.000 WARC paths
curl -sSL -O https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-50/warc.paths.gz

# Iterate through all paths using 64 scanning jobs, output matches as JSON
gzcat warc.paths.gz | \
xargs -I{} -- troll-a -e -s -j64 https://data.commoncrawl.org/{} > secrets.json

Warning

This will take a long time! Depending on your hardware and Internet connection, this can take anywhere from a week to several months. You may want to run this example only for the first few lines of warc.paths.gz.

Internet Archive

The Archive Team is a group dedicated to digital preservation and web archiving founded in 2009. Web archives are stored as WARC files (more specifically, in MegaWARC format) and made available through the Internet Archive.

For example, to extract secrets from the 113.372 pages the Archive Team crawled from pastebin.com in April of 2023 (here’s the corresponding publication on the Internet Archive), you can do this:

# Call troll-a directly with the MegaWARC URL
troll-a -e https://archive.org/download/archiveteam_pastebin_20230421003309_a3b951b4/pastebin_20230421003309_a3b951b4.1603050931.megawarc.warc.zst

…which results in…

Detected: secret="acf30fb56amsh654fa8104418601p1e420cjsn3152a0032f0b" rule="rapidapi-access-token" uri="https://pastebin.com/raw/bKMJXkQE" line=36 column=15
Detected: secret="acf30fb56amsh654fa8104418601p1e420cjsn3152a0032f0b" rule="rapidapi-access-token" uri="https://pastebin.com/raw/bKMJXkQE" line=36 column=15
Detected: secret="acf30fb56amsh654fa8104418601p1e420cjsn3152a0032f0b" rule="rapidapi-access-token" uri="https://pastebin.com/raw/nferefe2" line=37 column=6
Detected: secret="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" rule="github-pat" uri="https://pastebin.com/print/cQEA2GCS" line=39 column=123
Detected: secret="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" rule="github-pat" uri="https://pastebin.com/embed_js/cQEA2GCS" line=11 column=2688
Detected: secret="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" rule="github-pat" uri="https://pastebin.com/embed_iframe/cQEA2GCS?theme=dark" line=49 column=123
Detected: secret="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" rule="github-pat" uri="https://pastebin.com/cQEA2GCS" line=222 column=123
Detected: secret="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" rule="github-pat" uri="https://pastebin.com/raw/cQEA2GCS" line=22 column=22
Detected: secret="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" rule="github-pat" uri="https://pastebin.com/embed_iframe/cQEA2GCS" line=48 column=123
Detected: secret="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" rule="github-pat" uri="https://pastebin.com/embed_js/cQEA2GCS?theme=dark" line=11 column=2796
Detected: secret="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" rule="github-pat" uri="https://pastebin.com/clone/cQEA2GCS" line=152 column=27
Success: Processed https://archive.org/download/archiveteam_pastebin_20230421003309_a3b951b4/pastebin_20230421003309_a3b951b4.1603050931.megawarc.warc.zst (113372 records)

Install

Copyright (C) 2023 crissyfield