AI Training Data Crawlers: Your Forum Content Is Probably in Someone's Dataset

Started by StevenArroyo, Today at 01:13 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Topic: AI Training Data Crawlers: Your Forum Content Is Probably in Someone's Dataset   Views(Read 66 times)

StevenArroyo

If you run or post on public forums your content has almost certainly been collected by AI training data crawlers. Common Crawl has been archiving the public internet since 2008 and its datasets feed into training for multiple large language models. GPTBot from OpenAI crawls actively. Claudebot from Anthropic. Bytespider from ByteDance. Google-Extended collects for Gemini training. The content people wrote years ago on forums like this is embedded in AI models being used today. Some forum operators are now blocking these crawlers via robots.txt. Others are accepting it as the price of having public content. The legal status of this collection is actively being litigated in multiple jurisdictions.
First post best post