
The Wikimedia Foundation, which operates sites affiliated with Wikipedia, recently disclosed that its infrastructure has been overwhelmed by waves of AI-powered web crawlers. These automated bots have consumed vast amounts of costly server resources, placing a heavy burden on Wikimedia engineers who struggle to mitigate their impact through technical means, while simultaneously inflicting substantial operational costs on the organization.
Wikimedia serves as a free repository for images, videos, and various other media, currently hosting over 144 million files. This immense archive has become a prime target for AI scrapers, which relentlessly harvest its content to compile datasets for training machine learning models.
In addition to Wikimedia Commons, Wikipedia itself has also been subject to aggressive and indiscriminate scraping. Acknowledging that technical defenses are no longer sufficient to hold back this surge, the organization has taken a proactive stance—curating and releasing an AI-optimized dataset specifically designed for training purposes. These datasets are now hosted on Google’s Kaggle platform, a community hub for data scientists, in the hope that AI developers will download structured data directly rather than continuing to bombard Wikipedia’s servers.
The newly released dataset has been meticulously crafted with machine learning workflows in mind, enabling AI practitioners to easily access machine-readable article data for modeling, fine-tuning, benchmarking, alignment, and analysis. All included content is publicly licensed and freely distributable.
The dataset is current as of April 15, 2025, and includes research abstracts, concise descriptions, image links, infobox data, and article sections. It intentionally omits references, source documents, and audio files, focusing solely on textual and structural elements. The initial release features both English and French versions.
Wikimedia believes that a well-structured dataset in JSON format will be far more appealing than the laborious task of scraping and parsing raw Wikipedia content. Whether this approach will succeed in curbing the onslaught of AI web crawlers, however, remains to be seen.
Related Posts:
- AI Crawlers Deluge Wikimedia Commons, Consuming Over 65% of High-Cost Bandwidth
- Wikipedia goes offline in serveral countríe to protest the upcoming copyright law in the EU
- Leaked Documents Reveal NVIDIA’s Secret AI Training Practices