In many cases where a page deals with internal duplicate content, URL parameters can be to blame for the majority of duplicates. What are URL parameters, and why can they cause duplicate content?
What are parameters?
Parameters are instructions that are attached to the URL of a page to influence its content in a certain way. One of the most common use cases is the sorting and filtering of product results in an online shop. The filters can be colours, sizes and other product features, for example.
An example of this could be the following URL:
https://www.shop.domain.com/shoes/mensshoes/brand.html?colour=black&size=42&material=leather
With the parameters mentioned above,
colour=black size=42 material=leather
only black leather shoes in size 42 will now be displayed.
Parameters can be recognised by the fact that the first parameter of a chain is introduced with a question mark (?
) and all other parameters are connected with the ampersand (&
). Moreover, parameters can be put together in any order.
Other common use cases for parameters are internal searches, session IDs and displaying the print version of a page.
Why can parameters create duplicate content?
The concept of a URL is that it is always a unique address of a resource on a server. An example of this is the difference between
https://www.domain.com/hello-world.html
and
https://www.domain.com/Hello-world.html
From a purely technical point of view, these are two different URLs, as a distinction is made between upper and lower case.
For our example shop page, this means that we could combine our filters differently, but will still find black leather shoes in size 42 on all these URLs:
https://shop.domain.com/shoes/mensshoes/brand.html?colour=black&size=42&material=leather
https://shop.domain.com/shoes/mensshoes/brand.html?colour=black&material=leather&size=42
https://shop.domain.com/shoes/mensshoes/brand.html?size=42&colour=black&material=leather
https://shop.domain.com/shoes/mensshoes/brand.html?size=42&material=leather&colour=black
https://shop.domain.com/shoes/mensshoes/brand.html?material=leather&colour=black&size=42
https://shop.domain.com/shoes/mensshoes/brand.html?material=leather&size=42&colour=black
For Google, all URLs are unique, but the content is always the same. A classic case of duplicate content.
The number of possible URLs is the factorial (n!) of the filters used. If we were to select a further filter, e.g., type=loafer
, we could already create 24 (1x2x3x4) different URL combinations that all deliver the same results. With 5 filters, there would be 120 URLs with identical content.
What can I do about this?
There are various ways to avoid duplicate content caused by parameters. These differ from each other in terms of effort and possible implementation. We would like to present a small selection.
Please note that all adjustments represent a more or less major intervention in the functioning of the page. The implementation of them should therefore be carefully thought through.
Many of the options are also very technical and require help from developers and IT resources.
Option 1: Do not use unnecessary parameters
This is a technically more complex, but nevertheless cleaner option. In most cases, parameters can be avoided completely. Session IDs can be saved via cookies and print versions of a page can be created with CSS without needing to create a new URL.
Option 2: Sort parameters
This option is especially suitable if you offer many filters on the page.
You provide your system with a specific parameter sequence and your system assembles the URLs, with new parameters, in such a way that there can only ever be a maximum of one URL per filter combination.
For our example above, this could look like this: the parameter order should always be colour > size > material > type.
So if for the URL
https://shop.domain.com/shoes/mensshoes/brand.html?colour=black&material=leather
the “size” filter is then additionally selected, the system automatically creates the URL
https://shop.domain.com/shoes/mensshoes/brand.html?colour=black&size=42&material=leather
If it is not clear during the implementation which parameters there can be, another possibility would be to have the parameters sorted alphabetically.
Option 3: Explain your parameters to Google
Google offers the possibility to categorise URL parameters for Google via the Search Console.
Google has written its own help document on this, which you should take to heart. The URL parameter tool can be a double-edged sword, as it can lead to Google no longer indexing pages that actually belong in the index if used incorrectly.
If you use this method and have created an Onpage project with your domain, you should exclude the desired parameters in the crawl settings.
Option 4: rel=”canonical”
In many cases, this option is an easy strategy to implement, even if it is not the cleanest. The reason for this is that there are plug-ins for the vast majority of content management systems that enable the setting of rel=”canonical” tags. This means that the changes do not have to be implemented by IT. In addition, this mark-up can be read by all major search engines without any problems.
Here you select a canonical (original) version for the relevant filter combination, while all other URLs are marked with the canonical tag.
So if we have selected
https://shop.domain.com/shoes/mensshoes/brand.html?colour=black&size=42&material=leather
as the canonical version, the 5 other URLs with the identical filters all get the same canonical tag in the <head>
section of the HTML source code:
<link rel="canonical" href="https://shop.domain.com/shoes/mensshoes/brand.html?colour=black&size=42&material=leather">
Option 5: Noindex
The second approach, which can be done via plug-ins in most content management systems, is to set a noindex in the <head>
of the page.
<meta robots="noindex">
This tells Google (and other search engines) that this document should not be included in the index. You can therefore consider which pages are important for users but have no place in the Google index.
If no further instruction is attached to the robots meta element, the crawler automatically assumes that it is allowed to follow all links – even if the document itself is not transferred to the index.
<meta robots="noindex, follow">
For our example filters, this would mean that Google follows the links on the filtered pages, but does not have the filter pages themselves in the index.
Canonical and Noindex
Google advises not to combine these two specifications.
Conclusion
URL parameters can quickly lead to a confusing amount of duplicate content. Dealing with parameters is therefore not always trivial and, depending on the content management system used, it may be that many settings cannot be made without programming knowledge.
In these cases, Google offers a practical way to define the parameters of a domain more precisely with the Google Search Console. However, it is necessary to familiarise yourself with the URL parameter tool, as incorrect settings can lead to problems.
Therefore, it is easier to work with either the rel=”canonical” or the robots=”noindex” instructions in most cases. However, please do not use them together! It is very easy to confuse Google with them.