The robots.txt Mistakes That Quietly Tank Rankings | Recon
ARTICLE
The robots.txt Mistakes That Quietly Tank Rankings
Eight robots.txt configurations that silently exclude pages from search results, and the diagnostic sequence for catching them before traffic disappears.
A misconfigured robots.txt can deindex an entire site in days. Most major outages we've audited can be traced to a one-line change in robots.txt that nobody flagged at the time. The change was deployed Wednesday; traffic started dropping Friday; the cause was identified the following Tuesday after a week of speculation.
The danger is in how subtle the failure modes are. The site "works fine" — pages render, links work, the CMS is happy. Search engines just stop coming. By the time the traffic drop is visible in analytics, weeks of indexing damage have accumulated.
This post is the eight robots.txt mistakes we've seen take down client sites, the diagnostic for each, and the safer patterns.
The most catastrophic failure. The staging environment correctly disallows all crawling:
User-agent: *
Disallow: /
The line gets deployed to production with a CMS update, a site migration, or a hosting move. The result: every search engine is told to crawl nothing. Within 24–72 hours, GSC starts showing "Crawled - currently not indexed" for high-value pages. Within two weeks, the site is deindexed.
The fix: production robots.txt should never contain Disallow: /. Add a deploy-time check that fails the build if it does. The Meta Tag Analyzer flags this on every audit, but the damage often happens before the next audit runs.
The intent was bandwidth conservation. The effect: Google can't render the page properly. Google's renderer needs CSS to compute layout, JS to execute components, and images to evaluate above-the-fold content. Without them, the rendered DOM Google indexes is essentially blank or unstyled.
The downstream impact: Google flags the site as "Mobile usability issues", "Content visually inaccessible", or simply scores it poorly on Core Web Vitals because the rendered page doesn't match what users see.
The fix: allow Google to crawl all assets needed for rendering. Modern sites should not disallow CSS, JS, or image directories. The Mobile-Friendly Tester catches this by attempting to render the page as Google would and reporting blocked resources.
Many sites add Disallow: /search/ to prevent search-result pages from being crawled. The intent is correct — internal search-result pages create infinite combinations and clog the crawl budget.
The mistake: disallowing them in robots.txt doesn't deindex them. If Google has already indexed them (linked from another site, or previously crawlable), robots.txt prevents Google from recrawling — but the index entries stay. The result: stale search-result pages persist in Google's index forever because Google can no longer access them to refresh or remove them.
The fix: use noindex headers or <meta name="robots" content="noindex"> on internal search-result pages instead of (or in addition to) robots.txt. Google can crawl the page, see the noindex, and remove it from the index cleanly.
The pattern that works:
# robots.txt — let Google crawl them so it can see the noindex
# (don't add Disallow: /search/)
<!-- On every internal search-result page --><meta name="robots" content="noindex, follow">
The intent: "let Googlebot crawl everything except admin, block all other crawlers entirely". The Googlebot-specific block is correct.
The bug: the User-agent: * block doesn't add to the Googlebot block — it's only applied to crawlers that don't have their own block. Bingbot, for example, falls under * and is blocked entirely. The site is invisible to Bing, but the admin thought "we have a Bing strategy" because Bing has no explicit block.
The fix: explicit blocks per user-agent. Don't rely on * for crawlers you actually want to allow.
The site lists its sitemap and then disallows it. Google's behavior in this case is to ignore the sitemap (since it's blocked) and crawl whatever it can find through links. The site loses all the discovery benefit of having a sitemap.
The fix: never disallow the sitemap URL. The Page Speed Grader checks sitemap accessibility and flags conflicts.
Sites often disallow query parameters to prevent duplicate content:
User-agent: *
Disallow: /*?
The intent: block any URL with a query string. The unintended effect: pagination (/blog?page=2), filters (/products?category=shoes), and tracking parameters (/landing?utm_source=email) all get blocked. Pagination and filter pages stop getting crawled and indexed; tracking-parameter URLs (which should canonicalize back to the clean URL) get blocked entirely instead of canonicalizing.
The fix: handle parameter URLs with rel="canonical" tags pointing to the clean version, not with robots.txt disallow. Google's canonical handling is reliable for parameter normalization; robots.txt is the wrong tool.
For pagination specifically: don't block paginated URLs. Use rel="next" and rel="prev" if needed (Google deprecated these in 2019 but they remain useful signals to other engines). Let Google crawl every page.
The intent: hide sensitive directories from search engines. The unintended effect: announce to the world exactly where the sensitive content lives. robots.txt is publicly readable. Anyone (including attackers) can fetch https://example.com/robots.txt and read the list of directories the site doesn't want crawled — which is also the list of directories likely to contain something worth attacking.
The fix: protect sensitive directories with authentication, HTTP basic auth, IP allow-listing, or no-listing (server-level access control). robots.txt is not a security tool and should never be the only barrier to a sensitive URL. The Security Headers Checker audits HTTPS and security headers but cannot fix this — it's a directory-access-control problem.
A subtle one. A site launches a new section (/services/new-product-line/) and adds it to robots.txt as Disallow "until the content is finalized". The team forgets to remove the line. The new section never gets indexed.
The fix: never use robots.txt as a "draft" mechanism. Use noindex meta tags on draft pages, or keep them behind authentication entirely. If a page is publishable enough to be on the production domain, it's publishable enough to be crawlable.
When traffic drops unexpectedly, robots.txt is the first place to look. The sequence:
Fetch https://[client-domain]/robots.txt directly. Read the entire file. Look for any Disallow: / lines, any blocked asset directories, any per-user-agent conflicts.
Run the Google Search Console URL Inspection tool on a representative page that's losing traffic. The tool reports whether the URL is blocked by robots.txt and which directive.
Check the GSC Page Indexing report for "Blocked by robots.txt" entries. A sudden spike indicates a recent robots.txt change.
Check git log robots.txt (or equivalent) for recent changes. The deploy that broke things will be in the history.
Test in the Search Console "robots.txt Tester" (still available in Legacy Tools as of 2026). It lets you test specific URLs against the live robots.txt.
If steps 1–5 don't surface the cause, the problem isn't robots.txt — it's somewhere else (meta robots tags, canonical issues, server-side blocking, manual penalty). But for sudden, broad traffic drops, robots.txt is correct on the first guess maybe 60% of the time.
No Disallow: / on production, no blocked CSS/JS/image directories, no Disallow on internal search (use noindex), explicit per-user-agent rules where needed, no Disallow on the sitemap URL, no parameter disallow patterns (use canonicals instead), no sensitive directory paths listed (use auth), no leftover draft blocks.
Eight things. Most sites have one or two going wrong. Catch them before the traffic drop, not after.