Log files are a wealth of information to help you improve your SEO strategy. In order to properly use them, you need to check that the data is accurate and comprehensive. And this is not always as obvious as you may think!
In this article, I will detail some of the typical mistakes due to a standard configuration and often incomplete logs writing.
HTTP vs HTTPS
Google has made many announcement about HTTPS in 2018 and the migration from HTTP to HTTPS has become a primary task for many SEO experts. For this type of project, log files are an efficient tool to make sure that the migration goes smoothly.
First of all, I would advise to set up a strong log files implementation, combined to an appropriate segmentation with an SEO crawler as OnCrawl. This will allow you to monitor the inclusion of your redirections and the progressive transfer from one crawl budget to another.
But in most cases, the original log format does not allow to differentiate the HTTP protocol from the HTTPS one. Why? Because of the lack of an element which can explicitly identify a targeted protocol.
Concretely, this element could be the port (80 for HTTP, 443 for HTTPS), the scheme (HTTP or HTTPS) or even the SSL/TLS protocol (ex. TLSV1.2).
One of the significant impacts would be to see two different status codes for one URL. During an SEO visit or a Googlebot visit on a HTTP page (http://www.site.com/a.html) properly redirected in 301 toward its HTTPS equivalent (here https://www.site.com/a.html), you would find two entries for /a.html: the first one in 301 and the second one with the final status code.
Before an HTTPS migration, make sure that your log files contain all the required information to ensure an efficient monitoring.
In some cases, the port is already present in your logs. However, you can’t be sure that it is the right one.
For example, in the logs formats options for the Apache servers, the port can be declined in 3 ways – Canonical, Local or Remote – which can sometimes lead to different results.
In other cases, it is not impossible that the only available logs come from an internal layer of your infrastructure which has unsecured exchanges with the other layers. It would be preferable to check that the returned port matches the one used by the visitors.
The IP address
In the same logic than the port, the IP address written in your logs may be wrong. Or at least different than the one expected.
According to the principle of layered infrastructure as previously seen, the returned IP address in your logs could be, for example, the one from your cache server which calls the pages/ressource targeted instead of the one of the user who is doing the request.
However, the IP address can be the only element which allows to dissociate the “real” Googlebots crawl from the crawl of your competitors or other tools which browse your site by falsifying their User-Agent. It seems relevant for your SEO strategy to ensure that you are working with the right information.
The host (or vhost)
Some servers host several websites at the same time. In many configurations, every site/host/vhost arrange its own log files.
These files are generally recognizable thanks to their name (which includes the vhost) or the name of the repertory in which they are stored.
However, it happens that the server configuration leads to the writing of logs from different sites in a common file.
In that case, it is important, perhaps mandatory, to add a field allowing to identify the website concerned in every log line. Without this, it is impossible to assign the returned activity in the logs for a precise website. The alternative would be, for example, to fill in the absolute URL as a Path (field indicating the URL called) as opposed to the relative URL that is usually found.
This would also offers a good alternative to the port presence or any other field which allows to identify the protocol.
The server time
Make sure that the time of your server is correct and respects the local timezone.
Overall, every log line includes the date and time, which facilitates the location time of a given event.
But it happens that the hour in the logs does not match (more or less) the exact incident time.
This may seem anecdotal but when you are trying to identify the precise time and cause of pic of errors for example, it is better to be able to quickly and easily spot the concerned lines.
Don’t forget to check this configuration point with your host.
Keep in mind that log files are rarely perfectly setup without your action.
Any modification of the writing rules of your logs will not be retroactive. Therefore, the sooner you optimize their format, the sooner your logs analysis will be efficient and useful to pilot your SEO strategy.