AI Training Data Crawlers: Your Forum Content Is Probably in Someone's Dataset

StevenArroyo · **Today** at 01:13 AM

If you run or post on public forums your content has almost certainly been collected by AI training data crawlers. Common Crawl has been archiving the public internet since 2008 and its datasets feed into training for multiple large language models. GPTBot from OpenAI crawls actively. Claudebot from Anthropic. Bytespider from ByteDance. Google-Extended collects for Gemini training. The content people wrote years ago on forums like this is embedded in AI models being used today. Some forum operators are now blocking these crawlers via robots.txt. Others are accepting it as the price of having public content. The legal status of this collection is actively being litigated in multiple jurisdictions.

AI Training Data Crawlers: Your Forum Content Is Probably in Someone's Dataset

StevenArroyo