Cloudflare crawler controls
Cloudflare managed robots.txt can change what crawlers see
If your site uses Cloudflare, the public /robots.txt response may include managed AI crawler rules in addition to, or instead of, the file your origin server serves.
The short version
Do not audit only the file in your repo or CMS. Audit the live /robots.txt response. Cloudflare can prepend managed rules for known AI crawlers, create a robots file when your origin has none, and expose content signals that newer tools may understand differently from older SEO validators.
What Cloudflare may add
| Situation | What can happen | Why it matters |
|---|---|---|
| Origin has a robots.txt file | Cloudflare may prepend managed AI crawler rules before your existing file. | Your repo file and the public crawler policy can differ. |
| Origin has no robots.txt file | Cloudflare may generate one for known AI crawler preferences. | A site can appear to have crawler policy even if the origin has none. |
| Content Signals are enabled | The live file may include usage signals for search, AI input, or AI training. | Some validators may warn even when Search crawling is not harmed. |
The audit order I would use
- Open the live
/robots.txtURL in a browser, not just your source file. - Look for a Cloudflare managed content marker or AI-specific user-agent blocks.
- Confirm whether the site owner intended to block training crawlers, search crawlers, or both.
- Check sitemap discovery separately; managed robots rules do not replace a clean sitemap.
- Use Cloudflare AI Crawl Control or WAF rules when you need enforcement, not just a preference file.
A policy that is clear but not overbroad
For many SaaS, docs, and publisher sites, the goal is not "block every AI-looking thing." It is usually more specific: keep normal search discovery open, make training preferences explicit, and avoid accidentally blocking user-triggered fetchers that help people find or summarize your public pages.
User-agent: Googlebot Allow: / User-agent: OAI-SearchBot Allow: / User-agent: GPTBot Disallow: / Sitemap: https://www.example.com/sitemap.xml
How CrawlerSignal uses this
CrawlerSignal fetches the live robots response because that is what crawlers see. If Cloudflare is changing the public file, the audit should reflect the public policy, not just the origin file. The score should still be interpreted as a checklist, not a verdict.