Crawling and Indexing for extensive websites

As soon as websites exceed the size of a typical private homepage, there are a number of new challenges that arise. One of them is that the existing content belongs in the Google index, as complete and up to date as possible. While this may sound easy, very large websites are prone to unsuspectingly make grave mistakes, as their content is scattered across numerous databases and comes from a number of different suppliers.

Even for Google there are limits to the resources they can provide for capturing and storing website content. Due to this, Google uses individual limits per domain: how many URLs are being crawled per day, how many of these pages are allowed into the Google index? Extensive websites can quickly run into these limitations. Due to this, it is important to use the available resources as intelligently and productively as possible. In this blog post, I want to quickly give you some background information on the topic and introduce specific processes by which you can control the crawling and indexing, as well as tell you about their pros and cons.

Crawling-Budget & Index-Budget

Even though these two terms are very close together contextually, there are still some important differences. In order to better understand them, we first look at the schematic (and simplified) structure of an internet search-engine:

The crawl, index, algorithm process that Google uses.

In order to have any chance at all to be considered for a search phrase by the search-engine’s ranking-algorithm, a piece of content on a domain first needs to be discovered and captured by the crawler and must then be added to the index.

Google has made a number of assumptions on the behaviour of the Googlebot that guarantees that they achieve two goals with their Googlebot: to discover new content rapidly and reliably identify and collect content that is hidden deeply within a page. How much effort Google puts into these goals depends on the domain’s crawl-budget. Attempts of having Google treat all domains equally have been fended off and Google will assign each domain its own crawling-budget. This crawling-budget determines how often the Googlebot crawls the first few levels of a domain and how deep a regular “Deep-Crawl” will go.

We see something similar with the index-budget: this budget decides on the maximum number of URLs which may be added to the Google index. It is important to keep in mind that only URLs which are crawled regularly will stay within the index.

Your enemies: web developers, JavaScript and general chaos

It all could be so easy. In theory, every piece of content you have should have a unique, logical, easy to understand URL – which stays exactly the same for the next decades.

Sadly, this utopia does not hold up to the real world: web developers decide on creating the third print version of a page, Googlebot learns a bit more JavaScript and suddenly invents completely new URLs and the website gets its third CMS-relaunch in two years, which leaves the original URL-concept in tatters. All of this will end the same way: Google will crawl unnecessary URLs and waste the domain’s crawl-budget. Which will then be missing in other places, especially when it comes to comprehensive projects. This could be the reason why your domain is not taking up the maximum amount of space they could have in the Google index, which will then cause the domain to stay below their maximum longtail-potential.

Panda duplicate content fright and index hygiene

It should have become clear by now that it is imperative for you to specifically control the crawling and indexing for extensive domains. There are also a few additional advantages which accompany your index hygiene. While Google has been trying to reassure everyone for the last few years that duplicate content does not pose a problem, reality kindly likes to disagree. Having order and a system to the crawling will enable you to notice duplicate content problems early on and take the necessary countermeasures. Having rather few but high quality pieces of content in the index may also help you against one of Google’s furry terrors: Google Panda.

Choose your weapons: robots.txt, noindex, canonical

So much for theory, now we move on to practice: How do I keep my domain clean? Luckily, you now have a very large and extensive arsenal of tools at your disposal in order to reach this goal. I want to quickly show you the most important ones and talk about their advantages and disadvantages.

Robots.txt-file

Instructions in the robots.txt-file are the eldest instrument of keeping search engines from visiting specific parts of your site. While the syntax was pretty simple in the beginning, there have been numerous extensions, mostly thanks to Google, which enables you to cover almost all bases. The advantage of the robots.txt: Googlebot will not visit the prohibited content, at all, which means no crawling-budget will be used. The disadvantage: if Google is convinced that the content is important nonetheless (because there are many external links to this URL, for example), the URL will still show up in the SERPs (Search Engine Result Pages) – just without a title and snippet.

Noindex instruction

The noindex instruction always refers to a specific URL. It can either be part of the HTML source-code of a page, as a Meta-Tag, or it can be specified in the HTTP-Header. The latter is especially interesting for other file-formats, such as PDF and Word documents. In order to observe the noindex instruction, Googlebot will first have to process the URL. This uses up crawling-budget but no index-budget. Noindex is the only reliable method to ensure that a URL will not appear in the SERPs, under any circumstances. So please keep in mind that Google will need to read the instruction in order to process it, which means you should not also block the URL through the robots.txt file.

Canonical information

First off: there are very few legitimate applications for the use of the canonical information. If a web developer suggests the use of a canonical, then this is often because the actual problem is not supposed to be solved and only the repercussions should be mitigated through the canonical-tag. The canonical-tag is not a binding instruction for Google, unlike the robots.txt instructions and noindex, but merely a notice of which URL houses the actual content. Google will often follow this notice, but, as so often, not always. Pages with the canonical-tag use up crawling-budget in order for Google to find the tag and will likely also use index-budget as well, so that Google can compare the content on the page with other pieces of content in the index. In conclusion: keep your hands off the canonical-tag whenever possible.

Monitoring is a must

For the crawling of large and dynamically grown sites there is only one constant: all the mistakes that may somehow happen, will definitely happen. This is why it is imperative to regularly monitor the most important parameters. By now, Google will already help you quite a bit through their Search Console: the number of crawled and indexed pages should be a mandatory indicator. Having a soft spot for reading server-logfiles and being adept at using shell-tools can also be helpful. In the end I would like to note that our Toolbox can also be of help for certain monitoring tasks.