warcannon: High speed/Low cost CommonCrawl RegExp

WARCannon – Catastrophically powerful parallel WARC processing

WARCannon was built to simplify and cheapify the process of ‘grepping the internet’.

With WARCannon, you can:

Build and test regex patterns against real Common Crawl data
Easily load Common Crawl datasets for parallel processing
Scale compute capabilities to asynchronously crunch through WARCs at frankly unreasonable capacity.
Store and easily retrieve the results

How it Works

WARCannon leverages clever use of AWS technologies to horizontally scale to any capacity, minimize cost through spot fleets and same-region data transfer, draw from S3 at incredible speeds (up to 100Gbps per node), parallelize across hundreds of CPU cores, report status via DynamoDB, and CloudFront, and store results via S3.

In all, WARCannon can process multiple regular expression patterns across 400TB in a few hours for around $100.

WARCannon is fed by Common Crawl via the AWS Open Data program. Common Crawl is unique in that the data retrieved by their spiders not only captures website text, but also other text-based content like JavaScript, TypeScript, full HTML, CSS, etc. By constructing suitable Regular Expressions capable of identifying unique components, researchers can identify websites by the technologies they use, and do so without ever touching the website themselves. The problem is that this requires parsing hundreds of terabytes of data, which is a tall order no matter what resources you have at your disposal.

Install & Use

Tags: WARCannon