Some crawlers gather data for both search and AI training, so when publishers block them to protect content they risk disappearning from search results ...
Cloudflare on Wednesday said it will soon prevent mixed-use crawlers from accessing ad-supported customer websites by default, part of its ongoing efforts to give site publishers more control over how they engage with AI services.Apple, Google, and Microsoft's Bing operate crawlers that could fall afoul of Cloudflare's decision, although each of the tech giants offers an AI opt-out that may allow them to escape sanctions.Web crawlers make automated network requests to websites for various purposes. Google has used them for decades to visit websites for inclusion in its search index.
Over the past few years, many crawlers have started visiting sites to harvest content for training AI models. This has prompted various countermeasures – publishers feel they're not being fairly compensated for the content AI companies scrape to feed into their models.
But since Google's crawler, Googlebot, combines crawling for search indexing and content harvesting for AI training, site publishers have tended to accept the bot's presence because they fear blocking could mean they disappear from Google Search results.The situation is similar for Microsoft's Bingbot. And Apple also has enlisted its Applebot crawler to handle AI data gathering in addition to its indexing duties. The iBiz in June said: "The data crawled by Applebot may also be used to help train Apple foundation models powering generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools."Apple and Google support robots.txt directives that allow publishers to opt out of AI data harvesting (via Applebot-Extended and Google-Extended). Bing supports a content="noarchive" attribute for the robots meta tag that also blocks data harvesting. Other crawler operators, however, often ignore the voluntary robots.txt. Cloudflare therefore aims to provide site owners with a declarative content gate."Now that the majority of traffic on the Internet is non-human, we must go further and act faster so that a sustainable ecosystem can emerge," said Matthew Prince, co-founder and CEO of Cloudflare, in a statement."Cloudflare's new tools and partnerships give website owners increased visibility and commercial opportunities and reward AI companies that have bots with clear and transparent intent. We hope that our proposed default changes encourage mixed use crawlers to separate out search from agent use and training."












