10 May 2026 · 8 min read

What is the difference between llms.txt and robots.txt?

robots.txt is a formal internet standard from 1994 that tells web crawlers which pages they should not fetch. llms.txt is a 2024 proposal from Answer.AI that tells AI tools which pages on your site matter most. One is exclusion. The other is curation. Small business sites need robots.txt working correctly. llms.txt is a low-cost addition for doc-heavy sites and low priority for most others.

If you have heard the term llms.txt and assumed it was the same thing as robots.txt with a new name, you are not alone. The two files share a folder on every website. They look similar at a glance. The names rhyme. Most small business owners who ask me this question saw llms.txt mentioned in a newsletter and were quietly hoping it was just a re-spelling of something they already had.

It is not.

robots.txt and llms.txt do different jobs, sit at different levels of authority on the internet, and reach a different mix of bots. This post explains what each file is, who reads it, and what your small business website should actually have set up first.

The headline difference in one sentence

robots.txt tells crawlers what they are not allowed to fetch. llms.txt tells AI tools which pages on your site matter most. One is a block list. The other is a shortlist.

That is the whole difference in a sentence. The rest of this post unpacks it.

What robots.txt is

robots.txt is a small text file that lives at the root of your website. The file is the public, machine-readable place where you list which crawlers are allowed in and which directories they should leave alone.

It is old. The format was first proposed by Martijn Koster in 1994. It became the rulebook for every search and crawl bot for the next thirty years. In September 2022, the IETF (the group that publishes internet standards) made the format a formal standard called RFC 9309, the Robots Exclusion Protocol.

That move matters. It moved robots.txt from "what every serious crawler does" to "what every serious crawler is required to honour." Search engines, AI training crawlers, archive bots, and SEO tools all read robots.txt before they fetch one page on your site.

The file looks like this in its simplest form:

User-agent: *
Disallow: /admin/
Disallow: /private/

Three lines. Three jobs. The first line names the bot the rule applies to (the asterisk means "all bots"). The next two lines list directories the bot should not fetch.

The same file controls AI bots too. OpenAI, Google, and Anthropic have each published user-agent names you can use inside robots.txt to manage how their AI products see your content. The three you most often see in 2026 are:

  • GPTBot (OpenAI). Collects training data for OpenAI's foundation models. Documented in the OpenAI bots overview.
  • Google-Extended (Google). Controls whether Google uses your content for Gemini and Vertex AI training. It is referenced inside robots.txt the same way Googlebot is, but it is a separate directive.
  • ClaudeBot, Claude-User, and Claude-SearchBot (Anthropic). Three separate bots for training data, in-product fetches, and search indexing. Documented in Anthropic's privacy centre.

The key fact: every one of those AI crawlers reads robots.txt. The file is not just for Googlebot. It is the file the AI engines also check before they decide whether to fetch your pages.

What llms.txt is

llms.txt is a different file with a different purpose. It is a markdown file that lists the top pages on your site, with one short note per page, written for AI tools that want a fast, high-signal summary.

Jeremy Howard at Answer.AI proposed it in September 2024. The full spec lives at llmstxt.org. It is not an internet standard. It is a community proposal that gained traction during 2025 as doc-heavy sites started adopting it.

The shape is simple. A markdown H1 with the site name, an optional summary, and a list of links under H2 headings. A short example:

# Getrecommended

> Free and paid AI visibility checks for small business websites.

## Core docs
- [What is an AI visibility audit](/blog/what-is-an-ai-visibility-audit-for-a-small-business-website): Plain-English explainer of the four pillars.
- [How to read your scan report](/blog/what-should-a-small-business-do-first-after-an-ai-visibility-report): Reading guide for the free scan output.

## Foundations
- [Schema and content patterns](/blog/schema-and-content-patterns-ai-engines-reward): Which schemas AI engines reward, and why.

That is the entire pattern. A shortlist with descriptions, organised by topic.

The point of the file is simple. It saves an AI tool from parsing your full HTML, ads, nav, cookie banners, and JavaScript to find out what your site is about. llms.txt hands the AI a clean map. The bigger the site, the bigger the saving.

The catch: llms.txt is only useful if AI tools actually read it. Use is mixed. Anthropic, Stripe, Cursor, Cloudflare, Vercel, Mintlify, Supabase and other doc-heavy companies all publish llms.txt files. According to BuiltWith data referenced in recent industry coverage, more than 800,000 sites had implemented it by late 2025. The IDE coding assistants (Cursor, Continue, Cline) actively use it. The major consumer AI search engines have not said they read llms.txt at scale.

That gap is worth knowing before you spend a Friday writing one. The file is cheap to produce. It is not a sure visibility lever yet.

Side by side

The two files are easier to keep straight when you see them on the same axis.

| | robots.txt | llms.txt | |---|---|---| | What it does | Tells crawlers which pages they may not fetch | Tells AI tools which pages matter most | | Type of action | Exclusion (block list) | Curation (shortlist) | | Format | Plain text, line-by-line directives | Markdown, structured as a linked outline | | Status | IETF standard (RFC 9309, 2022) | Community proposal, no standards body | | First published | Original draft 1994. Standard 2022. | September 2024 | | Where it lives | yourdomain.com/robots.txt | yourdomain.com/llms.txt | | Who reads it | Every mainstream crawler, including all AI training and search bots | IDE coding agents and documentation tools today. Major AI search engines unconfirmed. | | Risk of getting it wrong | High. A bad robots.txt can hide your whole site from search. | Low. A bad llms.txt is mostly ignored. |

Notice the last row. That is why robots.txt earns the close attention. Get it wrong and you can vanish from search overnight. Get llms.txt wrong, very little happens.

Who actually reads each file

This is the part that matters most for a small business site. Vendor blogs often skip it.

robots.txt is universal. Every mainstream crawler reads it. That includes the AI crawlers. GPTBot, Google-Extended, ClaudeBot, OAI-SearchBot, Perplexity's bot, every search engine, every archive crawler, every SEO tool. If you want to control AI access to your content today, robots.txt is the file you edit.

llms.txt is selective. As of early 2026, the confirmed readers are:

  • IDE coding assistants like Cursor, Continue, and Cline.
  • Doc tools like Mintlify that make one for every site they host.
  • Some MCP-aware tools that read it for agent context.

The major consumer AI search engines have not said they read llms.txt in any docs I can find. They might be reading it quietly. They might start backing it later. They have not committed yet.

I know how this lands. After three months of "AI is changing everything," hearing the new file is not yet a visibility lever can feel like one more thing to write off. It is not. It just sits in a different bucket. llms.txt is a low-cost prep move, not a top-of-funnel fix.

What a small business website actually needs

Here is the practical order, written as a checklist for an owner who has thirty quiet minutes on a Friday afternoon and wants to do this once, properly.

1. Check robots.txt exists. Type yourdomain.com/robots.txt into a browser. If you see plain text, you are on track. If you see a 404, your site has no robots.txt and the default is "everything is open." That is fine for most small business sites, but it means you have no way to block specific bots later without first creating the file.

2. Decide your stance on AI training. If you are happy for your content to be used to train OpenAI, Google, and Anthropic models, change nothing. If you would rather opt out of training but stay visible in search, add the AI bots below to robots.txt with a Disallow rule. Both choices are defensible. Most small businesses I work with stay opted in for visibility reasons.

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: ClaudeBot
Disallow: /

3. Leave the live-fetch bots in. OAI-SearchBot, Claude-User, and Claude-SearchBot fetch a page when a user asks ChatGPT or Claude a live question. Blocking those means you do not appear in those answers at all. Do not block them by default.

4. Decide if llms.txt is worth your thirty minutes. It is, if your site has deep documentation, a knowledge base, a help centre, or a long-form blog where the priority pages are not obvious from the homepage. It is less impactful for a five-page service site where the priority pages are already linked from the homepage navigation.

5. If you do write llms.txt, keep it short. A shortlist with three to five priority pages, each with a one-sentence description, beats a wall of links. The whole point of the file is curation. A messy llms.txt defeats the purpose.

That is the full job. Most owners can do the robots.txt review in fifteen minutes with help from a dev or a CMS plugin. The llms.txt is a side call. It can wait a week or a quarter.

Common mistakes I see

A few patterns come up often enough to call out.

Treating llms.txt as a replacement for robots.txt. It is not. Adding an llms.txt to a site with no robots.txt does nothing for access control. The two files solve different problems and need to coexist.

Blocking all AI bots reflexively. A blanket ban on every AI user agent is rarely the right move. You lose visibility in answers without preventing your content from being talked about elsewhere on the web. Choose deliberately.

Forgetting subdomains. robots.txt and llms.txt only cover the domain they sit on. If your blog lives at blog.yourdomain.com and your main site at yourdomain.com, each subdomain needs its own file. A common miss.

Listing every page in llms.txt. The file's job is to be a shortlist. Listing every page defeats the point. Pick the three to five pages a curious AI tool should read first, and stop there.

Writing llms.txt for marketing tone. llms.txt is a reference document, not a brochure. A one-line factual description of each page beats a sales line. The file is being read by a machine that cares about signal density, not warmth.

A short closing thought

If you take one thing from this post, take this: robots.txt is the file that does the heavy lifting for AI access control today, and it is the same file that has been doing it since 1994. llms.txt is a useful, low-risk addition for sites with depth, but it is not yet a guaranteed visibility lever for the major AI search engines.

Run the robots.txt check first. Decide your stance on training. Then, if it makes sense for your site, write a short llms.txt as a low-cost prep step. Both files together cost less than one afternoon for most small business sites, and the work compounds across every future AI tool that comes along.

If you would like to see how AI tools currently see your business across the major engines, the free AI visibility scan at GetRecommended.io is one option in this category. Tools in the broader category include HubSpot's AEO Grader, Otterly, SE Ranking, and similar checkers. Pick by the engines and signals you most want covered.

Sources


Frequently asked questions

Is llms.txt a replacement for robots.txt?

No. They do different things. robots.txt tells crawlers which pages or directories on your site they should not fetch, and it is a recognised internet standard (RFC 9309). llms.txt tells AI tools which pages on your site matter most for summarisation, and it is a community proposal, not a standard. A site that only has llms.txt has no access control. A site that only has robots.txt has no curated shortlist for AI tools. They sit side by side, not in place of each other.

Do I need both files for a small business website?

Most small business sites need a working robots.txt before they need anything else. That single file tells search and AI crawlers which areas to leave alone, and it is read by every major bot. llms.txt is optional. It helps most for sites with deep documentation, knowledge bases, or product references. For a typical service-business site with a homepage, services, FAQ, and contact, the gain from adding llms.txt is small. The gain from a clean robots.txt and good schema is larger.

Do AI search engines actually read llms.txt files in 2026?

Use is mixed. Anthropic, Stripe, Cloudflare, Vercel, Mintlify, Supabase and many documentation sites publish llms.txt files. The IDE coding assistants (Cursor, Continue, Cline) and some MCP integrations actively use them. The major AI search platforms (OpenAI, Google, Anthropic crawlers) have not, as of early 2026, said they read llms.txt at scale. The file does no harm, costs about thirty minutes to write, and is well worth doing for doc-heavy sites. Treat it as a low-risk addition rather than a high-impact fix.

Where should each file live on my website?

Both files live at the root of your domain. robots.txt sits at yourdomain.com/robots.txt. llms.txt sits at yourdomain.com/llms.txt. Both should return a plain text file with a 200 status. If your site runs on a CMS or static site builder, your dev (or a built-in setting) places the file. Subdomains get their own files. blog.yourdomain.com/robots.txt is a separate file from yourdomain.com/robots.txt.

If I block GPTBot in robots.txt, will I still appear in ChatGPT?

Maybe. OpenAI has several user agents that do different jobs. GPTBot collects training data. OAI-SearchBot fetches pages for ChatGPT search-style answers. Blocking GPTBot only blocks training-data collection. If you also want to be excluded from ChatGPT search results, you would need to block OAI-SearchBot as well. Anthropic and Google make a similar split: training crawlers and live-fetch crawlers. Block on purpose, not by default. The trade-off is visibility against training data.

Is llms.txt the same as the file that controls Google search?

No. The file that historically controls Google search behaviour is robots.txt, with rules per user-agent like Googlebot. To opt out of Google using your content for AI model training, Google introduced a separate user-agent called Google-Extended that you reference inside robots.txt. llms.txt is a different file altogether and is not part of Google's documented stack. The right way to manage Google's AI use of your content is still robots.txt with a Google-Extended rule, not llms.txt.

See where you stand

Free 60-second AI visibility scan. No account, no card.

Get Your Free AI Visibility Score

Get new posts in your inbox

Practical AI search guides, sent when we publish.

Unsubscribe anytime. Privacy Policy.