AI crawler policy for GEO — search=yes, ai-train=no, and full Disallow
GEO requires AI systems to access your site. LuminaForge explains Content-Signal, robots.txt blocks, and why crawler policy belongs next to schema and llms.txt — with our public allow-list.
AI crawler policy is whether ChatGPT, Perplexity, Claude, and Google can fetch your pages in the first place. GEO is not only schema and blog posts — it is access plus extractable answers plus public measurement.
LuminaForge publishes its crawler allow-list at robots.txt and citation snapshots on the transparency dashboard. This guide explains the signals many GEO agencies confuse.
Three layers — not one switch
| Layer | What it controls | GEO impact |
|---|---|---|
robots.txt Disallow | Whether a bot can crawl URLs at all | Full block usually prevents live retrieval and citation |
Content-Signal: search=yes | Permission to index / retrieve for answers | Intended to allow AI search and RAG-style use |
Content-Signal: ai-train=no | Opt-out of model training | Does not replace an explicit allow for retrieval |
Short answer: ai-train=no is not the same as welcoming AI citations. A full Disallow: / for GPTBot blocks access entirely — stronger than any Content-Signal line.
What LuminaForge allows
On luminaforge.ai, public marketing routes explicitly allow:
GPTBot,OAI-SearchBot,ChatGPT-UserPerplexityBot,Perplexity-UserClaudeBot,Claude-Web,anthropic-aiGoogle-Extended,Applebot-ExtendedCCBot,Bytespider,Amazonbot
Admin paths and gated client previews stay disallowed. See the live file: robots.txt.
Common mistakes
- Cloudflare one-click “block AI bots” — marketing sells GEO while IT enables a managed block list. Audit robots.txt after any CDN toggle.
- Yoast llms.txt without crawler access — auto-generated llms.txt is not a substitute for allowing bots to read your pages.
- Training opt-out only —
ai-train=nodoes not tell Perplexity or ChatGPT browse to cite you; you still need fetchable HTML and open rules. - Schema without performance — crawlable but slow pages lose trust signals. We verify Core Web Vitals on transparency.
How AI platforms fetch answers (why policy differs)
| Platform | Typical live web use | Citation behavior |
|---|---|---|
| Perplexity | Real-time search active | Source links common in answers |
| Google AI Overviews | Grounded in Google index | Sources from indexed pages |
| ChatGPT | Browse / search optional | Citations vary by mode and query |
| Claude | Browse optional | Citations vary by query |
Even when a platform leans on an existing search index, blocking AI-specific crawlers can still hurt freshness, llms.txt discovery, and brand-controlled pages you ship after indexation.
Generative share of voice — measure what access enables
After crawlers can reach your site, measure generative share of voice (SoV):
SoV (%) = (engine checks citing your domain ÷ total engine checks) × 100
LuminaForge runs a conversational query bank weekly and publishes results — including early 0% baselines — on /transparency. Client engagements receive the same reporting model.
SMB checklist (5 minutes)
- Open
/robots.txt— areGPTBotandPerplexityBotallowed on public pages? - Confirm
llms.txtandllms-full.txtexist and list services + locations. - Spot-check one service page for FAQPage JSON-LD and a plain-language FAQ block.
- Run three conversational queries in Perplexity — is your brand named?
- Request a free AI Visibility Snapshot if you want LuminaForge to score the full stack.
Next steps
- How to get cited by ChatGPT and Perplexity — full implementation playbook
- GEO for home services — local SMB vertical guide
- Generative Engine Optimization service — how LuminaForge delivers GEO with web development
Like this kind of work?
Bring it to your site.
