AI companies might find it harder to access the entire web to train their large language models after the internet infrastructure provider Cloudflare said this week it would block AI data crawlers by default.
It’s the latest front to open in an ongoing fight between the creators of content and the AI developers who use that content to train generative AI models. In court, authors and content creators are suing major AI companies for compensation, saying copyrighted content was used without permission. (Disclosure: Ziff Davis, CNET’s parent company, in April filed a lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)
While content providers are seeking compensation for information that was used to train models in the past, Cloudflare’s move marks a new defensive measure against future efforts to train models.
But it isn’t just about blocking crawlers: Cloudflare says it wants to create a marketplace where AI companies can pay to crawl and scrape a site, meaning the provider of that information gets paid, and the AI developer gets permission.
“That content is the fuel that powers AI engines, and so it’s only fair that content creators are compensated directly for it,” Cloudflare CEO Matthew Prince said in a blog post.
Why websites want to block AI crawlers
Crawlers — bots that visit and copy the information from a website — are a vital component of the connected internet. It’s how search engines like Google know what’s on different websites, and how they can serve you the latest information from places like CNET.
AI crawlers pose distinct challenges for websites. For one, they can be aggressive, generating unsustainable levels of traffic for smaller sites. They also offer little reward for their scraping: If Google crawls a site for search engine results, it will likely send traffic back to that site by including it in search results. Being crawled for training data might mean no additional traffic or even less, if people stop visiting the site and rely just on the AI model.