If you use the robots.txt to block access to a directory or specific page for search engine crawlers, this page/directory will not be crawled or indexed. In certain cases, Google will show a page that is blocked through the robots.txt in the SERPs.
- Why do I find my page in the search results even though it is blocked through the robots.txt?
- When does a blocked page appear in the SERPs?
- Google is increasingly paying attention to user signals – an example
- How to definitively keep content from showing up on the search result pages
- Uncrawled URLs in search results
You can block the directory “a-directory” and the page “a-page.html” for webcrawlers with the following addition to the sites robots.txt:
User-agent: *
Disallow: /a-directory/
Disallow: /a-page.html
Why do I find my page in the search results even though it is blocked through the robots.txt?
In certain cases, Google will show a page that is blocked through the robots.txt in the SERPs (Search Engine Results Pages).
For these instances it is important to know that the crawler does respect the robots.txt and has not added the content of such blocked pages to their index. Google therefore has no information available when it comes to this page.
When does a blocked page appear in the SERPs?
If the blocked page has a lot of incoming links with a definitive link text, then Google may view the content of the page as relevant enough to show the URL that appears in these linktexts in the search results. The content of that URL, however, is still unknown to Google as they are unable to crawl or index the page.
You can usually recognise pages within the SERPs that were blocked through the robots.txt from being crawled and indexed by a missing snippet (for example the description).
Google is increasingly paying attention to user signals – an example
We use the robots.txt to block access to our page http://www.domain.com/grandmas-cakerecipe.html. Google’s crawlers honour our request to not crawl and index the contents of the page. Google therefore has no idea what content is in the file grandmas-cakerecipe.html.
Let us say that this page contains a world class recipe and we get a lot of incoming links from other pages, many of with use the linktext “Grandma’s World Class Pie Recipe”. In such cases, our blocked page http://www.domain.com/grandmas-cakerecipe.html could appear in the search engine result pages (SERPs) for the query “Grandma’s World Class Pie Recipe” – despite us blocking crawlers through the robots.txt.
How to definitively keep content from showing up on the search result pages
The robots.txt is not guaranteed to keep your page out of the search results.
To make sure that a page will definitely be kept out of the search results, you should use the Meta-Element Robots with the value NOINDEX.