The webinar A results-focused approach to log file analysis is a part of the SEO in Orbit series, and aired on May 7th, 2019. For this episode, we discussed some of the applications of log file analysis that can be applied directly to your SEO strategy for measurable results. Ian Lurie and OnCrawl’s Alice Roussel develop use cases of log file analysis applied to SEO.

SEO in Orbit is the first webinar series sending SEO into space. Throughout the series, we discussed the present and the future of technical SEO with some of the finest SEO specialists and sent their top tips into space on June 27th, 2019.

Watch the replay here:

Presenting Ian Lurie

Ian Lurie is a digital marketing consultant with 25 years of experience. Earlier in his career, he founded Portent, a Clearlink Digital Agency. Portent provides paid and organic search and social media, content, and analytics services to B2B and B2C brands including Patagonia, Princess, Linode, and Tumi.

Ian’s professional specialties and favorite topics are marketing strategy, search, history, and all things nerdy. He spends far too much time poring over Amazon search patents, Google rankings and natural language processing theory. His random educational background includes a B.A. in History from UC San Diego and a degree in Law from UCLA. Ian recently wrote about log file analysis on the Portent blog. You can find Ian on Twitter at @ianlurie

This episode was hosted by Alice Roussel, SEO Operations Manager at OnCrawl. Alice puts her years of experience as a Technical SEO Manager on the agency side to good use; she provides daily support and hands-on training for clients so they can boost their ability to find actionable takeaways in crawl data. Passionate about data analysis and how it can be put to work for SEO, she uses technical skills to make a real difference. Her ideal day involves reading Google patents & running log analyses. You can also find her on twitter, or blogging at https://merci-larry.com

What is a log file?

A log file is a text file stored on a server. Log files can log all sorts of different information.

Any time a client–whether a browser or a bot or any other type of client–request any type of file from a website, the server adds a line to the file.

Each line includes certain information, which will vary depending on the configuration of the site and the server. Usually, you’ll find:

  • Referrer: the thing (item, website…) that was clicked or viewed that allowed the visitor to find their way to the current item on the website
  • User-Agent: an identification of the type of user or bot
  • IP address: the numeric address of the client
  • Date and time of the visit
  • URL of the page or file
  • Status code: what happened when the file was requested.
  • Size of the file: possibly the most data in a log line. It tells you how much information was transferred in KB when the page or file was requested.

Log files as a “single source of truth”

Ian credits a co-worker at Portent for the phrase “single source of truth”: the one defined place where you know you’re going to get the most accurate measure.

When it comes to measuring web traffic, log files are the single source of truth. If something on a web server is accessed, it’s going to appear in a log file. Any file that is hit by any visiting client will be recorded, assuming the server is correctly configured.

Javascript, which is most often used to fire analytics pixels such as Google Analytics can be inaccurate for many reasons:

  • They may be inaccurately configured
  • Bots may not fire them
  • Browsers that don’t support Javascript won’t fire them
  • Slow page load can prevent them from firing

Regardless of what happens with the analytics pixel, the hit is still recorded in the log file.

Under-use of log files

Ian has to spend a lot of time persuading clients to allow SEOs to have access to their log files. Based on how difficult it is for SEOs to get access to this data, we can say that log files are badly under-used. This is also reinforced by Ian’s experience discussing with clients what type of work other agencies have done; log file analysis is never cited. There are a lot of insights to be gained from log file analysis that are not being taken advantage of today.

Log files have been around for over 20 years, but SEO has never paid much attention to them.

Use cases for log files in SEO:

– Adjusting the distribution of crawl budget

Ensuring that Googlebots spend time on your key pages and adjusting the distribution of your crawl budget is one of the principal uses of log files.

– Finding hidden links causing crawl waste and ranking drops

A few years ago, Ian had a client with a big site on which every page had a “forward this to a friend” link and a “request more information” link. When the site was relaunched by the client, the rankings plunged and it was difficult to diagnose the problem since everything looked fine. On-page SEO and the Google site index indications were great.

In the log files, though, it was clear that Googlebot was hitting hundreds of thousands of pages that weren’t visible on the site. It turned out that the client thought they’d removed these links from the pages, but they’d just made them invisible. Google was spending 90% of its time crawling these invisible links, and only 10% of its time and energy on pages with meaningful content.

Because these links weren’t visible, without the log files, it would have taken weeks or months to diagnose and correct the ranking drop.

– Checking that bots hit the right pages

Ian usually looks for quality rather than quantity. It doesn’t matter whether search engine bots hit a site more or less often after an SEO analysis and changes. It’s more useful to look at the following points in order to tell whether we have the desired bot behavior:

  • Do search engine bots access the pages SEOs want them to access?
  • Do they access files and pages we don’t want them to need to access?
  • Are there certain pages that search engine bots are overlooking and not accessing?

– Checking the number of each type of HTTP status

Check how many types of each response are actually provided to bots:

  • 404
  • 302
  • 5xx
  • 200

Google Search Console doesn’t always provide the most accurate data for this, so log files can provide a much more accurate measure for this metric.

– Confirming whether Google follows or supports directives

By using log files to monitor Googlebot behavior, we’ve seen changes in how Google supports certain directives. Rel=canonical and rel=next/prev are big examples. We can see how Google works its way through pages, which allows us to see whether these directive declarations are actually working on a site.

It’s also worthwhile to check whether Google’s actually obeying noindex and nofollow declarations. This can tell us whether these directives are working, whether they have previously worked, or if Google has started to ignore them.

[Note: Since this episode aired, Google has announced that nofollow would be considered a hint rather than a directive. Using log files to confirm whether or not the hint is followed on your website will become increasingly important as this begins to influence ranking and, later, indexing.]

When Google makes recommendations that SEOs use directives or strategies, Ian is often a sceptic, such as with rel=next/prev. While Ian’s advice is to follow Google’s recommendations, he also relies on log files to look at whether this makes a difference in Googlebot behavior or not on websites he’s working on. In the specific case of rel=next/prev, looking at Googlebot behavior on paginated pages is nothing new. We can keep an eye on this behavior to see if there are changes:

  • Are Googlebots suddenly getting caught up in pagination tunnels where they weren’t stuck before?
  • Are they starting to crawl pages in a pagination cycle, but no longer in order? (This would indicate that rel=next/prev is no longer working because Google is not treating “next” as “next” and is not following the sequence despite having identified the paginated series.)

– Monitoring delays between publishing and first organic traffic

For online publishers, this is a very common use: measuring the time it takes from when the page is put online, to the time when Google starts crawling it, to the time when it starts showing up in the rankings.

Log files are a great way to know and understand whether a page has been “viewed” by Google. It can also be a way to reassure publishers, particularly one with large sites, who are worried when content is published but not yet receiving organic traffic. Log files can help by determining whether the lack of traffic is because Google has not looked at the page yet or not. Depending on the answer, the solution will be different.

– Benefiting from up-to-date or real-time information

Log files can also be useful because Google Search Console may not be completely up to date. We can find out within minutes, if we’re analyzing log files in real time, whether or not Googlebots have hit a particular resource.

– Examining correlations with indexing or page performance

There’s not a perfect correlation between when a page is hit by a bot and when the page will first appear in the index. However, Ian has consistently seen that if a page is hit by a bot, it will show up in the SERPs a “short time” later.

– Observing advantages of site authority in indexing time

A “short time” can be anywhere from 30 minutes to 30 days. And in general, the more popular and acknowledge a site is a source of authority, the quicker it will show up in the Google index. (Don’t yell at @IanLurie on Twitter for saying this.)

If you’re a top-20 news site and a page hit by Googlebot doesn’t show up in the index within about an hour, it’s time to worry. On the other hand, if you’re a less well-known publication and Googlebot lands on the page, you should worry if it’s been a week and your page hasn’t shown up in the index.

In short, there’s not a fixed amount of time required to get a page indexed, but you can use log files to determine what is normal for your site and to adjust your expectations accordingly.

How to improve indexing speed:

– Technical SEO

Ian will always lean towards technical SEO first in getting a page to be indexed and rank faster. Technical improvements that affect indexing speed might include, for example, site performance: the more quickly Google can crawl on your site, the more pages it will likely put into its index.

– EAT

EAT (Expertise-Authority-Trustworthiness) has an impact today on how quickly you show up in the index.

– Site architecture and hierarchy

Pages need to be linked as high up in the site hierarchy as possible. Looking at evidence in log files, it’s pretty clear–at least for the first few times a site is crawled–that Googlebot crawls a site starting at the top of the site hierarchy, usually at the home page, and working its way down.

– Demonstrate page importance through linking structure

The internal links inside your site should indicate the importance of a page. To get a page to rank more quickly, multiple pages on the site need to link to it, and it’s even better if they do so from primary or secondary navigation.

Setting up log formats for SEO

Log files aren’t standardized. There are a lot of different formats, of which W3C is likely the most common.

Regardless of the format, the most important thing is to make sure that your server is configured to store the right data in those log files. It’s possible that, when a server is first set up, that it isn’t automatically configured to store the referrer, the response code, or the user-agent.

Once you understand what to look for in a log file, it’s pretty easy to interpret, no matter what the format is. It’s much more important to make sure that the right data is present. You’ll want at least the following information:

  • Referrer
  • User-Agent
  • Date and time
  • Response code

Avoiding levels of interpretation of data

SEOs should use log files because they are the single source of truth. If you are an SEO and you want an accurate look at how search engine bots are crawling your site, there’s no other way to do it.

This is also a way to remove levels of abstraction introduced by tools, no matter how useful, like Google Search Console. The Search Console is Google’s interpretation of what they saw when they visited your site. You want to see how they visited the site, without interpretation.

With this in mind, log files have so many use cases for which we often depend on interpretive tools:

  • Finding broken links
  • Finding redirect chains
  • Finding temporary redirects
  • Etc.

There’s no predefined set of use cases that make log files useful. What makes them useful is the fact that they provide an unedited, unbiased, raw set of data on how bots are crawling your site.

Using GREP to sift through large files

Ian’s favorite technical trick is the Linux command line tool, GREP, because it allows you to sift through a large file really quickly. This can be useful because log files can quickly become enormous. On a large site, in just a day or two, you can end up with files of millions of lines. And most desktop tools can’t handle a file when it gets that large.

GREP will let you do things like:

  • Filter all image requests
  • Filter all non-bot requests
  • Only look at Googlebot requests

Here’s a GREP primer, if you’re not familiar with it.

“Why should I give you my log files?”

This is Ian’s least favorite question. It’s difficult to understand the opposition to providing SEOs with a log file.

There are no security issues unless your server is configured very badly.

It’s easy to do: just zip up the file and send it.

Different Googlebots to monitor in log files

Googlebot with a smartphone user-agent is one of the bots that Ian follows attentively at the moment. This is a big one right now.

Googlebot-image is another one that Ian looks at a lot. There’s been a lot of evidence recently that image crawls interfere with or impact overall crawls on a site.

Arguments to get access to clients’ log data

If you’re not trying to be diplomatic, you can just ask: “Why not?” and then attempt to help clients understand why their issue shouldn’t be a problem.

It also comes back to the idea of a single source of truth. You can go through a site by hand, crawl it with your own crawler, look at the Google index and at the Google Search Console. But the only way to get an extremely accurate view of how Googlebots see your site is by looking at the log files.

It’s faster, easier, and more accurate than trying to work without them.

Log files and security

Log files don’t pose a security risk, particularly because SEOs don’t need access to your server to look at them: they can be provided by someone within your company.

Your log files should only be showing things that Googlebot and people are actually accessing: these are things that are already accessible to the public. So there’s no security concern there unless your website is so badly configured that Googlebot and public users are finding their way to things that they shouldn’t be seeing. In this case, you probably want your SEO to see this–and tell you!

While giving access to the file on your server might pose a potential security risk, the file itself and its contents do not.

Log files and images

Ian pays particular attention to images in log files because it’s not unusual to see images being indexed in Google Images that shouldn’t be. There are a lot of sites that also still use “invisible” images for their layout, and a lot of images now that are lazy-loaded as you scroll down a page. Log files will help identify whether or not Googlebot is having trouble with lazy-loaded images or not.

Image size can also be an issue as well, but in this case the logs are not always the best place to obtain that data; you may want to use a crawler of your own to figure that out.

When to use real-time log monitoring

There can be a benefit to using real-time monitoring with log files. It will depend on the size of the site, for one. For example, a site with a few thousand visits per day won’t get a lot out of real-time monitoring. However, if you have a few million visits per day from people, there’s a good chance you have a lot of traffic from Googlebot as well, and you’ll likely want to analyze that as quickly as possible.

In general, the bigger your site is and the more often you add new content, the more you can see benefits from real-time or near-real-time log file analysis.

Top tip

“Log files are the single source of truth in terms of web traffic.”

SEO in Orbit went to space

If you missed our voyage to space on June 27th, catch it here and discover all of the tips we sent into space.