The popular online forum Reddit recently disclosed that it had discovered artificial intelligence companies harvesting Reddit data via the Internet Archiveβs Wayback Machine, a practice the company says violates its terms of service.
Reddit has already blocked most search engine crawlers and AI scrapers from accessing its content. Under current policy, any party wishing to scrape Reddit data for AI model training must first obtain a commercial license and pay a fee. For example, Google reportedly pays Reddit up to $60 million annually for data access, allowing it to harvest vast numbers of posts and other content for training its modelsβan arrangement Google still considers worthwhile.
Historically, Reddit has collaborated with the Internet Archive to index posts and preserve snapshots in the Wayback Machine for future reference. However, AI companies seeking to avoid licensing fees have begun redirecting their crawlers to the Internet Archive, using it as a proxy to obtain Reddit data.
Upon discovering this, Reddit announced it would immediately begin blocking the Internet Archive from crawling and indexing most of its pages. The Wayback Machine will no longer be able to capture post detail pages, comments, or user profiles. Instead, it will be limited to indexing only certain public-facing elements such as the Reddit homepage and popular post listingsβeffectively restricted to titles and similar metadata.
Redditβs CEO stated that the company would begin enforcing these restrictions as of today, having already notified the Internet Archive in advance. The Internet Archive has confirmed it is in active discussions with Reddit regarding the matter.
This move follows Redditβs recent lawsuit against Anthropic, the developer of Claude, alleging that Anthropic scraped Reddit content without authorization. Even after Reddit explicitly blocked its crawlers, the company claims Anthropic continued to harvest data, in direct violation of its terms of service.
Related Posts:
- Cloudflare Launches “Pay Per Crawl”: Websites Can Now Charge AI Crawlers for Content
- Reddit Data Breaches: Emails, Passwords leaked
- Reddit stopped to support Bitcoin payments
- Reddit Sues Anthropic: Battling Unauthorized AI Data Scraping!
Support Our Threat Intelligence
If you find our CVE report and cybersecurity news helpful, consider supporting our work.