Cloudflare, the global internet services provider, has recently introduced an AI Crawler Leaderboardβa dynamic red-and-black list designed to validate, identify, and assess web crawlers operated by artificial intelligence companies across four key dimensions. The initial evaluation includes crawlers from OpenAI, Google, Meta, Anthropic, xAI, and ByteDance.
As of now, only OpenAIβs ChatGPT crawler series has received commendable ratings, while xAIβs Grok crawler and ByteDanceβs crawler occupy the bottom of the listβByteDance ranking last due to failing across all measured criteria.
The leaderboard will soon expand to track and rate RAG (retrieval-augmented generation) and search engine crawlers as well, with more entities to be added over time. Based on this evaluation, website administrators can decide whether to take more aggressive measures to block specific crawlersβespecially as robots.txt has become largely ineffective.
The four evaluation dimensions are as follows:
- Verified crawler via IP:
Has the AI company publicly disclosed the IP ranges used by its crawlers? Publishing this information allows accurate identification and prevents malicious impersonation by rogue bots. - Verified crawler via WebBotAuth:
WebBotAuth is a protocol that authenticates crawler identities through cryptographic signaturesβoffering greater reliability than IP-based recognition alone. - Separate crawlers:
Crawler segmentation is essential. By distinguishing between different types of crawlers, websites can selectively allow or block themβfor instance, disabling data-mining crawlers while allowing those used for search indexing that may drive valuable traffic. - Obeys robots.txt:
This standard industry convention informs crawlers about which parts of a site they may or may not access. Some crawlers, however, disregard this protocol entirely.
ByteDanceβs crawlers reportedly scour the entire internet daily while ignoring robots.txt guidelines. Moreover, ByteDance has not published the IP ranges associated with its bots, making it impossible for administrators to verify whether traffic claiming to originate from βBytespiderβ is genuinely legitimate.
That said, other AI crawlers have also fallen short. For example, those operated by Anthropic and xAIβs Grok may likewise fail to honor robots.txt. Since none of these companies have provided verifiable IP ranges, Cloudflare is currently unable to determine with certainty whether they are complying with crawler best practices.
Related Posts:
- TikTok’s U.S. Ban Postponed: ByteDance Scrambles for a Long-Term Solution
- ChatGPT Crawler Vulnerability: DDoS Attacks via HTTP Requests
- CapCut’s New Terms: ByteDance Gains Perpetual Rights to User Content, Likeness, & Voice Globally
- Google Requires JavaScript for Search: Bots and Crawlers Impacted
Support Our Threat Intelligence
If you find our CVE report and cybersecurity news helpful, consider supporting our work.