Skip to content
LuminaForge
All resources
8 min read

AI crawler policy for GEO — search=yes, ai-train=no, and full Disallow

GEO requires AI systems to access your site. LuminaForge explains Content-Signal, robots.txt blocks, and why crawler policy belongs next to schema and llms.txt — with our public allow-list.

Generative Engine Optimization guidegeorobots-txtcrawlersllms-txt

AI crawler policy is whether ChatGPT, Perplexity, Claude, and Google can fetch your pages in the first place. GEO is not only schema and blog posts — it is access plus extractable answers plus public measurement.

LuminaForge publishes its crawler allow-list at robots.txt and citation snapshots on the transparency dashboard. This guide explains the signals many GEO agencies confuse.

Three layers — not one switch

LayerWhat it controlsGEO impact
robots.txt DisallowWhether a bot can crawl URLs at allFull block usually prevents live retrieval and citation
Content-Signal: search=yesPermission to index / retrieve for answersIntended to allow AI search and RAG-style use
Content-Signal: ai-train=noOpt-out of model trainingDoes not replace an explicit allow for retrieval

Short answer: ai-train=no is not the same as welcoming AI citations. A full Disallow: / for GPTBot blocks access entirely — stronger than any Content-Signal line.

What LuminaForge allows

On luminaforge.ai, public marketing routes explicitly allow:

  • GPTBot, OAI-SearchBot, ChatGPT-User
  • PerplexityBot, Perplexity-User
  • ClaudeBot, Claude-Web, anthropic-ai
  • Google-Extended, Applebot-Extended
  • CCBot, Bytespider, Amazonbot

Admin paths and gated client previews stay disallowed. See the live file: robots.txt.

Common mistakes

  1. Cloudflare one-click “block AI bots” — marketing sells GEO while IT enables a managed block list. Audit robots.txt after any CDN toggle.
  2. Yoast llms.txt without crawler access — auto-generated llms.txt is not a substitute for allowing bots to read your pages.
  3. Training opt-out onlyai-train=no does not tell Perplexity or ChatGPT browse to cite you; you still need fetchable HTML and open rules.
  4. Schema without performance — crawlable but slow pages lose trust signals. We verify Core Web Vitals on transparency.

How AI platforms fetch answers (why policy differs)

PlatformTypical live web useCitation behavior
PerplexityReal-time search activeSource links common in answers
Google AI OverviewsGrounded in Google indexSources from indexed pages
ChatGPTBrowse / search optionalCitations vary by mode and query
ClaudeBrowse optionalCitations vary by query

Even when a platform leans on an existing search index, blocking AI-specific crawlers can still hurt freshness, llms.txt discovery, and brand-controlled pages you ship after indexation.

Generative share of voice — measure what access enables

After crawlers can reach your site, measure generative share of voice (SoV):

SoV (%) = (engine checks citing your domain ÷ total engine checks) × 100

LuminaForge runs a conversational query bank weekly and publishes results — including early 0% baselines — on /transparency. Client engagements receive the same reporting model.

SMB checklist (5 minutes)

  1. Open /robots.txt — are GPTBot and PerplexityBot allowed on public pages?
  2. Confirm llms.txt and llms-full.txt exist and list services + locations.
  3. Spot-check one service page for FAQPage JSON-LD and a plain-language FAQ block.
  4. Run three conversational queries in Perplexity — is your brand named?
  5. Request a free AI Visibility Snapshot if you want LuminaForge to score the full stack.

Next steps

Like this kind of work?

Bring it to your site.

Book a strategy call

Ready to forge?

Let's build the site that becomes the answer.

Tell us about your project. We respond within one business day and book the first call inside a week.