Robots.txt and Crawlers: Are You Actually Blocking What You Think You're Blocking?

Started by JayJ, Jun 26, 2026, 08:32 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Topic: Robots.txt and Crawlers: Are You Actually Blocking What You Think You're Blocking?   Views(Read 25 times)

JayJ

Robots.txt is the standard way to tell web crawlers which pages not to visit. Disallow GPTBot and Claudebot and theoretically those AI training crawlers won't collect your content. The problem is robots.txt is voluntary. It's a convention not a technical barrier. Legitimate crawlers from reputable companies honour it. Less scrupulous scrapers ignore it entirely. Your forum's robots.txt file is worth reviewing if you haven't looked at it recently. What are you currently blocking? What are you allowing? And more importantly are the crawlers you think you're blocking actually respecting it? Server logs will tell you.

What does your robots.txt setup look like? Are you blocking AI training crawlers? Have you found any that claimed to respect robots.txt but didn't?

QuantumDay

Thank you for asking.
Ours is obviously

https://qday.forum/robots.txt

We've relied heavily on prior works by folk over at SMF.

Sitemap: https://qday.forum/sitemap.xml

User-agent: *
Crawl-delay: 25

#Easiest version for pretty urls for google/bingbot should be (last edit removed _ as in some links)
Disallow: /*.
Disallow: /*=
Disallow: /*?
Disallow: /*;

# But because everyone else doesn't support * we have to

# Directories
Disallow: /attachments/
Disallow: /avatars/
Disallow: /custom_avatar/
Disallow: /cache/
Disallow: /Favicons/
Disallow: /Packages/
Disallow: /pwa-icons/
Disallow: /Smileys/
Disallow: /Sources/
Disallow: /Test/
Disallow: /Themes/
Disallow: /tools/
Disallow: /videos/

# Script pages
Disallow: /cron.php
Disallow: /proxy.php
Disallow: /proxy_thumb.php

# Session and parameter traps
Disallow: /*PHPSESSID=
Disallow: /*prev_next=
Disallow: /*;sort=
Disallow: /*;wap2
Disallow: /*;topicseen
Disallow: /*.msg
Disallow: /*/msg
Disallow: /index.php?msg=
Disallow: /index.php?

# Pretty URL actions
Disallow: /pwa-manifest
Disallow: /agreement
Disallow: /acceptagreement
Disallow: /activate
Disallow: /admin
Disallow: /announce
Disallow: /attachapprove
Disallow: /buddy
Disallow: /calendar
Disallow: /clock
Disallow: /coppa
Disallow: /credits
Disallow: /deletemsg
Disallow: /dlattach
Disallow: /editpoll
Disallow: /editpoll2
Disallow: /findmember
Disallow: /groups
Disallow: /help
Disallow: /helpadmin
Disallow: /jsmodify
Disallow: /jsoption
Disallow: /likes
Disallow: /lock
Disallow: /lockvoting
Disallow: /login
Disallow: /login2
Disallow: /logintfa
Disallow: /logout
Disallow: /markasread
Disallow: /mergetopics
Disallow: /mlist
Disallow: /moderate
Disallow: /modifycat
Disallow: /movetopic
Disallow: /movetopic2
Disallow: /notifyannouncements
Disallow: /notifyboard
Disallow: /notifytopic
Disallow: /pm
Disallow: /post
Disallow: /post2
Disallow: /printpage
Disallow: /profile
Disallow: /quickmod
Disallow: /quickmod2
Disallow: /quotefast
Disallow: /recent
Disallow: /reminder
Disallow: /removepoll
Disallow: /removetopic2
Disallow: /reporttm
Disallow: /requestmembers
Disallow: /restoretopic
Disallow: /search
Disallow: /search2
Disallow: /sendactivation
Disallow: /signup
Disallow: /signup2
Disallow: /smstats
Disallow: /splittopics
Disallow: /stats
Disallow: /sticky
Disallow: /suggest
Disallow: /theme
Disallow: /trackip
Disallow: /unread
Disallow: /unreadreplies
Disallow: /uploadAttach
Disallow: /verificationcode
Disallow: /viewprofile
Disallow: /vote
Disallow: /viewquery
Disallow: /viewsmfile
Disallow: /who
Disallow: /xmlhttp

# index.php action equivalents. Non-pretty versions
Disallow: /index.php?action=acceptagreement
Disallow: /index.php?action=activate
Disallow: /index.php?action=admin
Disallow: /index.php?action=agreement
Disallow: /index.php?action=announce
Disallow: /index.php?action=attachapprove
Disallow: /index.php?action=attbr
Disallow: /index.php?action=buddy
Disallow: /index.php?action=calendar
Disallow: /index.php?action=clock
Disallow: /index.php?action=coppa
Disallow: /index.php?action=deletemsg
Disallow: /index.php?action=dlattach
Disallow: /index.php?action=downloads
Disallow: /index.php?action=editpoll
Disallow: /index.php?action=editpoll2
Disallow: /index.php?action=findmember
Disallow: /index.php?action=groups
Disallow: /index.php?action=help
Disallow: /index.php?action=helpadmin
Disallow: /index.php?action=jsmodify
Disallow: /index.php?action=jsoption
Disallow: /index.php?action=likes
Disallow: /index.php?action=lock
Disallow: /index.php?action=lockvoting
Disallow: /index.php?action=login
Disallow: /index.php?action=login2
Disallow: /index.php?action=logintfa
Disallow: /index.php?action=logout
Disallow: /index.php?action=markasread
Disallow: /index.php?action=mergetopics
Disallow: /index.php?action=mlist
Disallow: /index.php?action=moderate
Disallow: /index.php?action=modifycat
Disallow: /index.php?action=movetopic
Disallow: /index.php?action=movetopic2
Disallow: /index.php?action=notifyannouncements
Disallow: /index.php?action=notifyboard
Disallow: /index.php?action=notifytopic
Disallow: /index.php?action=pm
Disallow: /index.php?action=post
Disallow: /index.php?action=post2
Disallow: /index.php?action=printpage
Disallow: /index.php?action=profile
Disallow: /index.php?action=quickmod
Disallow: /index.php?action=quickmod2
Disallow: /index.php?action=quotefast
Disallow: /index.php?action=recent
Disallow: /index.php?action=reminder
Disallow: /index.php?action=removepoll
Disallow: /index.php?action=removetopic2
Disallow: /index.php?action=reporttm
Disallow: /index.php?action=requestmembers
Disallow: /index.php?action=restoretopic
Disallow: /index.php?action=search
Disallow: /index.php?action=search2
Disallow: /index.php?action=sendactivation
Disallow: /index.php?action=signup
Disallow: /index.php?action=signup2
Disallow: /index.php?action=smstats
Disallow: /index.php?action=splittopics
Disallow: /index.php?action=stats
Disallow: /index.php?action=sticky
Disallow: /index.php?action=suggest
Disallow: /index.php?action=theme
Disallow: /index.php?action=trackip
Disallow: /index.php?action=unread
Disallow: /index.php?action=unreadreplies
Disallow: /index.php?action=uploadAttach
Disallow: /index.php?action=verificationcode
Disallow: /index.php?action=viewprofile
Disallow: /index.php?action=vote
Disallow: /index.php?action=viewquery
Disallow: /index.php?action=viewsmfile
Disallow: /index.php?action=who
Disallow: /index.php?action=xmlhttp
Disallow: /index.php?action=.xml

## We disallowed any lines with dot . in so need to override for media images/video and js . css .xml
Allow: /*.xml$
Allow: /*.ico$
Allow: /*.css$
Allow: /*.js$
Allow: /*.png$
Allow: /*.jpg$
Allow: /*.jpeg$
Allow: /*.gif$
Allow: /*.webp$
Allow: /*.svg$
Allow: /*.heic$
Allow: /*.mp3$
Allow: /*.mp4$
Allow: /$

#my forum specific quirks
Allow: /post-games/
I'm not always right, but I'm never wrong ;)

SGHolly

Robots.txt is security theatre against bad actors. It only works against good actors who by definition weren't the problem

Mason0

The major AI companies do generally honour robots.txt because the legal and reputational risk of not doing so is significant for them