The prominent U.S. CDN provider Fastly has released its Q2 2025 Threat Defense Report, revealing that AI-driven bots are reshaping web traffic patterns, with the most significant risks stemming not from data-gathering crawlers, but from real-time inference queries performed during model usage.
According to the report, nearly 80% of AI-related traffic originates from crawlers harvesting training data. While this volume is substantial, the true threat lies in inference-phase scraping, where AI platforms, in the course of responding to user prompts, issue live queries across the internet to retrieve information.
At peak load, these real-time queries can bombard a single website with up to 39,000 requests per minute, far exceeding the roughly 1,000 requests per minute typically generated by training-data crawlers. Once complete, AI bots may embed a handful of links in their responses for user verification, even though they may have queried hundreds of sites to construct an answer.
If websites lack proper concurrency controls or defensive measures, such bursts of requests can mimic the effects of a Distributed Denial of Service (DDoS) attack, overwhelming servers and leading to congestion or outright outages.
As for traffic sources, the report notes that the overwhelming majority of AI crawler activity originates from Meta, Google, and OpenAI, which together account for 95% of observed crawler traffic—with Meta at 52%, Google at 23%, and OpenAI at 20%.
In the realm of real-time inference scraping, however, OpenAI dominates, with its ChatGPT-User and OAI-SearchBot crawlers responsible for 98% of this traffic. Unlike training crawlers, these bots serve as live agents retrieving web content on behalf of user queries.
Regionally, North American websites see 90% of AI traffic from training crawlers, while in Europe, the Middle East, and Africa, the balance tilts the other way, with 59% stemming from real-time inference queries. The Asia-Pacific and Latin American regions remain dominated by training-data crawlers.
In terms of content sourcing, OpenAI’s GPTBot (dedicated to training data collection) has the widest reach, with coverage extending to 95% of unique websites in the dataset. OpenAI’s strategy appears to favor maximum breadth, crawling as many sites as possible, while Meta pursues depth, indexing fewer domains but attempting to exhaustively capture their content.
Related Posts:
- Google Requires JavaScript for Search: Bots and Crawlers Impacted
- Red Hat & AMD Deepen AI Partnership: Optimizing AI and Virtualization
- Critical Triton Flaws (CVSS 9.8) Expose AI Servers to Remote Takeover – Patch Now!
- CVE-2024-0087: NVIDIA Releases Security Patch for Critical Flaw in Triton Inference Server
- Red Hat Unveils llm-d: Scaling Generative AI for the Enterprise
Support Our Threat Intelligence
If you find our CVE report and cybersecurity news helpful, consider supporting our work.