Reddit Tries to Block Bots, Web Crawlers to Stop Unlicensed AI Data Scraping


Reddit is updating its Robots Exclusion Protocol, or robots.txt file, to try to block bots and web crawlers from swiping data and content from its site.

Reddit says “good faith actors” like the Internet Archive will continue to have access to its platform, however, and adds that most Reddit users won’t be affected by or notice the change. Reddit will also continue its practice of rate-limiting, which may help prevent third-party scraping.

This isn’t an ironclad solution; as Google notes, there are loopholes to evade robots.txt rules.

“The instructions in robots.txt files cannot enforce crawler behavior to your site; it’s up to the crawler to obey them,” Google states. “While Googlebot and other respectable web crawlers obey the instructions in a robots.txt file, other crawlers might not.”

This means that AI startups could still swipe Reddit data and train their models on the sly—even though Reddit’s policies explicitly forbid it. This month, Business Insider reported that both OpenAI and Anthropic have been circumventing robots.txt files to scrape websites anyway. It’s unclear whether Reddit’s Tuesday update directly addresses these firms’ methods.

“You may not use content on Reddit as…input for any model training without explicit consent from Reddit. Commercial use of any model trained with Reddit data is prohibited without explicit approval,” the company’s policies state.

Last month, Reddit hinted further restrictions and changes were coming in a post on its public content policy. “We see more and more commercial entities using unauthorized access or misusing authorized access to collect public data in bulk, including Reddit public content,” the company says. “Worse, these entities perceive they have no limitation on their usage of that data, and they do so with no regard for user rights or privacy, ignoring reasonable legal, safety, and user removal requests.”

Recommended by Our Editors

Reddit has made some data deals of its own. In February, Google and Reddit entered into a $60 million content licensing deal that allows Google to use Reddit’s API and lets Reddit use Google’s VertexAI. Reddit responses later began appearing in Google Search AI Overviews, with mixed results.

ChatGPT may also start citing Reddit posts soon, thanks to another official partnership announced last month. It’s unclear whether Reddit content will help train OpenAI’s next models, but it’s possible considering AI firms’ seemingly endless hunger for new data. Reddit may have to get more specific soon as the FTC in March launched an investigation into its licensing of user data.

All this comes after Reddit limited access to its API last year, in part to prevent AI companies from scraping its data for free. That prompted a developer revolt, a brief subreddit blackout, and the demise of some popular Reddit clients.

PCMag Logo OpenAI Reveals Its ChatGPT AI Voice Assistant

Get Our Best Stories!

Sign up for What’s New Now to get our top stories delivered to your inbox every morning.

This newsletter may contain advertising, deals, or affiliate links. Subscribing to a newsletter indicates your consent to our Terms of Use and Privacy Policy. You may unsubscribe from the newsletters at any time.



Source link

We will be happy to hear your thoughts

Leave a reply

beautysace
Logo
Compare items
  • Total (0)
Compare
0
Shopping cart