The document type definition at the beginning of a document indicates what type of document it is. This is about the version of the markup language, more precisely the HTML, that is used, and is how the browser knows how to display a website correctly.
If the declaration is missing, this is not a disaster, because many browsers can handle it. However, it can lead to errors in the display. This is because if a doctype declaration is missing, the visitor’s browser automatically switches to Quirks mode. This is a compatibility mode designed to ensure that obsolete and invalid codes are rendered correctly. Thus, it is not a matter of ensuring function, but of enabling the desired rendering.
If the document type declaration is missing, it is quite straightforward to integrate it afterwards. It is simply inserted above the head element.
Pages without HTML-lang attribute (and also without hrefelang attribute) do not contain any reference to the language version of the website in the programming code. This can have an impact in two main areas: search engine optimisation and the use of screen readers.
With an attribute like, for example, a website would indicate that it was written in German. This allows search engines to deduce in which language and country searches this website is relevant. Screen readers can also determine the correct pronunciation.
However, according to Google’s John Mueller, anyone who wants to do international SEO should completely ignore the lang attribute and exclusively use hreflang as a decision-making tool for language versions. This makes it possible to integrate references to the respective other versions and to the page itself in the source code and thus unambiguously indicate which version is where.
The reason for the low importance of the lang attribute is that incorrect use of the lang attribute is common due to copying of templates. The hreflang attribute is much more often used correctly and therefore respected by Google.
Character encoding in HTML is controlled in the header. A meta tag specification tells the browser the correct encoding:
<meta charset=”UTF-8″>
This is a classic example of a sensible character encoding. UTF-8 (Unicode) has become widely accepted for global character encoding in recent years and is now considered the standard. It is congruent with ASCII in the first 128 characters. Thus it causes a small memory requirement for English and many western languages and can be partly worked on also in text editors, which are not UTF-8 capable.
In Germany, UTF-8 is the standard encoding also because of this. However, there are regions and applications where more complex character sets are more commonly used, for example UTF-16.
Why is the right encoding so important? Characters or accented characters are quickly displayed incorrectly on the website. Instead, question marks, boxes or other characters then appear that have nothing to do with the originally selected character. Many then go over to replacing these with letter codes, i.e. the so-called named characters.
But this replacement is only a workaround, which is unnecessary because of the correct character encoding. The problem: If the character encoding is not set correctly, the browser must find out for itself which encoding it is. If it does not succeed, umlauts and special characters are no longer readable.
A clear encoding such as UTF-8, on the other hand, assigns a unique character string to each Unicode character, which can be up to four bytes long. If the browser knows which encoding it is dealing with, it can also assign and display the characters without any problems.
Disallow
The robots.txt file is seen as the first important hint. Here you can specify which subpages Google should not even crawl. Google will then not send any bots there and will not capture the content. If you want to make sure that your content does not show up in Google, this is the best choice.
Create a text file with the name robots.txt.
Insert the following code into it:
User-agent: *
Disallow: The URL of your page
With the asterisk after “User-agent” you address all search engine bots. If you only want to ban Google bots from your site, you have to name them individually after “User-agent”. But with “Googlebot” you have covered them all. If you want to exclude specific bots, you will probably find them in the following list:
- “Googlebot-Image/1.0” for Google Image Search.
- “Googlebot-Video/1.0” for videos.
- “Googlebot-Mobile/2.1” for mobile devices.
With the path after “Disallow” you specify the pages that Google should not index. In this case it would be the whole page. However, you can also specify subfolders or individual pages there if you only want to hide parts from Google.
Once you have entered all the information, you upload the robots.txt to the root of your URL. Google will then find it, but will not index the parts of your page that you specified.
Use of HTML tags
If you don’t want to hide your entire website from Google, but only want Google not to index certain subpages, the noindex tag is the best option. You then simply have to include the following meta tag on the respective page, in the source code:
<meta name="robots" content="noindex" />
However, while the disallow function is a very strict instruction to Google’s crawlers, using the noindex tag is recommended so that Google does not perceive it as a ban, but rather as advice. That’s why noindex is usually used more for search engine optimisation than to prevent indexing. Google usually finds these pages.
Hiding the content via a password query
If you protect an area of a website or even the whole website with a password, Google can’t crawl the content either. Disadvantage: Everyone who visits the website then needs a password to view the content.
This variant is also technically much more complicated. However, if you want to make sure that your content is protected from unauthorised views, this is the best choice. Most SEOs additionally set the log-in page to Disallow to protect sensitive data.
When a designer does not yet have the text available that is later to be placed in a certain position in a magazine or on a website, they use so-called dummy texts. In other words, it is a placeholder. With this placeholder, designers can determine the distribution of the text on the page, check the space required for the font, and assess readability.
Letters and word lengths pretty much follow the natural Latin language. This ensures that the dummy text and the actual text are unlikely to differ too much visually. And Lorem Ipsum is incomprehensible and meaningless, it is a verbalisation of Latin. That is why the viewer is not distracted by the placeholder text. This is all the more true because Lorem Ipsum is now probably the best known dummy text in Germany and people stop reading after the first two words when they encounter the placeholder again.
In contrast to other dummy texts, however, Lorem Ipsum is less suitable for comparing fonts. For this, “The quick brown fox jumps over the lazy dog”, a pangram containing every letter in English, is much better suited.