The artificial intelligence search company Perplexity AI has previously been accused of engaging in unauthorized web scraping. Even when websites explicitly disallow such activity via their robots.txt files, Perplexity AI reportedly disregards these directives and continues to extract content.
Now, a new research report released by Cloudflare reveals that Perplexity AI not only ignores these standard declarations, but also employs a variety of sophisticated techniques to evade firewalls and conceal its scraping behavior—actions that could potentially harm websites and publishers.
The robots.txt file is a widely recognized industry standard that allows websites to communicate with crawlers and bots, specifying which areas may be accessed and which should remain off-limits. Site owners can also use this file to completely disallow certain bots from crawling any part of their domain.
Despite websites having clearly stated in their robots.txt files that Perplexity AI’s crawler is not permitted to scrape content, Cloudflare has found that the company refuses to adhere to this protocol. Attempts to block its crawler prove ineffective.
Notably, some websites have opted for more aggressive defensive measures, such as issuing HTTP 403 responses upon detecting Perplexity AI’s crawler or its associated Autonomous System Numbers (ASNs), thereby denying access entirely.
In response, Perplexity AI has attempted to bypass these restrictions by rotating User-Agent strings and ASNs. Specifically, instead of using its official crawler identifier, the company impersonates regular users by mimicking standard User-Agent headers, while also switching to different ASNs to elude Cloudflare’s detection mechanisms.
The company’s disclosed crawler User-Agent is:
One of the spoofed User-Agent strings used to circumvent detection is:
According to Cloudflare, they were initially alerted to this behavior following complaints from clients who noticed that Perplexity AI was still scraping their websites—even after explicitly prohibiting such activity in their robots.txt files and actively blocking the crawler.
Upon receiving these reports, Cloudflare conducted internal testing and confirmed that Perplexity AI was indeed evading these defenses. The company had swapped out its official crawler identifiers for macOS and Chrome-based User-Agent strings to avoid being blocked by Cloudflare’s protection mechanisms or the websites themselves.
Due to these unethical scraping practices and deliberate attempts to circumvent firewall protections, Cloudflare has announced that it has removed Perplexity AI’s crawler from its list of verified bots. This decision will likely make it significantly more difficult for Perplexity AI to access content on sites protected by Cloudflare in the future.
In response, Perplexity AI issued a statement claiming that Cloudflare’s blog post is a marketing ploy to promote its own services. The company asserted that the screenshots used in the article show no actual content was accessed, and that the bot referenced by Cloudflare does not belong to them.
This response echoes previous criticisms leveled against Perplexity AI for ignoring robots.txt guidelines. The company appears consistently unwilling to acknowledge any wrongdoing, habitually insisting it has not violated rules or that the scraping activity was not carried out by them.
Related Posts:
- DuckDuckGo accuses Google of holding Duck.com to confuse its users
- Apple Eyes Perplexity AI Acquisition: Bolstering Search & Siri with Generative AI
- Cloudflare Unveils AI Crawler Leaderboard: ByteDance Ranks Last
Support Our Threat Intelligence
If you find our CVE report and cybersecurity news helpful, consider supporting our work.