[Webinar Digest] SEO in Orbit: Unlocking the secrets of indexing - Oncrawl

Home > Webinars > Unlocking the secrets of indexing

The webinar Unlocking the secrets of indexing is a part of the SEO in Orbit series, and aired on June 12th, 2019. In this episode, Kevin Indig shares his thoughts on getting pages indexed, how the pages indexed for a site influence site-wide rankings, and what pages shouldn’t be indexed. What is the right approach towards this intermediary step between getting pages discovered and getting them to appear on SERPs?

SEO in Orbit is the first webinar series sending SEO into space. Throughout the series, we discussed the present and the future of technical SEO with some of the finest SEO specialists and sent their top tips into space on June 27th, 2019.

Watch the replay here:

Presenting Kevin Indig

Kevin Indig has helped startups acquire +100M users over the last 10 years. He is VP SEO & CONTENT @ G2, a mentor for Growth @ GermanAccelerator, and ran SEO @ Atlassian and Dailymotion previously. His specialty is user acquisition, brand building, and user retention. Companies Kevin worked with include eBay, Eventbrite, Bosch, Samsung, Pinterest, Columbia, UBS, and many others. He also runs the curated technical marketing newsletter, Tech Bound.

This episode was hosted by Rebecca Berbel, the Content Manager at Oncrawl. Fascinated by NLP and machine models of language in particular, and by systems and how they work in general, Rebecca is never at a loss for technical SEO subjects to get excited about. She believes in evangelizing tech and using data to understand website performance on search engines.

Definitions

One of the reasons it’s important to talk about indexing is that it’s a complex topic. Many SEOs struggle with indexing and how to influence it.

It’s time for another SEO quiz.
You create a new page. Which of the following will keep it out of Google’s index?
A. Meta robots noindex
B. Robots.txt block
C. Giving the page meta noindex *and* blocking it in robots.txt
— Will Critchlow (@willcritchlow) June 9, 2019

– Crawling

Crawling in simple terms is the technical discovery process of search engines understanding a web page and all of its components.

This helps Google find all of the URLs that it can then go back and render, and then index and eventually rank.

– Google’s 3-step process

Crawling is part of Google’s 3-step process that leads up to being able to create search results:

Crawling
Rendering
Indexing

These are technically different processes, handled by different programs, or parts of the search engine.

Ranking is potentially a fourth step in this process.

– Indexing

Indexing is the process of Google adding URLs to it’s long “list” of possible results. If Kevin has to avoid the word “index” in a definition of indexing, he’d prefer to talk about a metaphorical “list”: Google has a “list” of URLs that it can use to rank and show as best results to users.

– Log files

Web servers keep a history any time anyone or anything asks for a page or a resource on the server.

Kevin is really passionate about log files as a source of truth when it comes to understanding how Google crawls and renders your site.

In the logs, we can find server information as to how often Google visits your site and what it does there, in very plain and simple terms. Log files contain individual records of each visit to the site.

You can get a ton of information from log files:

Specific status code errors
Problems with crawling
Problems with rendering
How much time Googlebot spends on your site
Which Googlebots come to your site. For example, with the Mobile First index, the main Googlebot used for indexing has recently been updated.
Whether your technical site structure is something that Google follows, or if you have something there that can be optimized.

Ways to check indexing

– Not recommended: “site:” queries

When Kevin started out in SEO about 10 years ago, he would see what pages on his site were indexed by running “site:” searches on Google. While he still uses this sometimes, it’s no longer a reliable way to find out whether a URL is indexed.

More recently, he asked John Mueller about this strategy; he verified that this is no longer recommended way to check what Google has or hasn’t indexed.

– Recommended: Search Console URL Inspection

John Mueller instead recommends using the URL Inspection Tool in the Search Console to check what has been indexed.

The cached page is not always representative of what’s indexed, and it’s generally only the static HTML that was fetched (if there’s JavaScript on it, it usually doesn’t run within the cached hosting). I’d focus more on the URL inspection tool.
— ???? John ???? (@JohnMu) May 8, 2019

– Recommended: XML sitemaps and the Coverage Report

Submitting an XML sitemap in the Search Console is one way to check a batch of your URLs and then check the sitemap in the Coverage Report in search console.

Importance in distinguishing between crawl-render-index

As mentioned, there’s a 3-step process in which Google crawls, renders, and indexes a page. It’s very important to distinguish between each of these steps. As the web becomes more sophisticated, Google has had to adapt, separating and improving these processes individually.

Different Googlebots

Multiple Googlebots are used by Google to crawl and render websites. You have different types of resources: images, videos, news, text… Google uses different Googlebots to understand each type of content.

Google announced about a month ago that they upgraded their rendering engine to run on evergreen Googlebot and the latest Chromium engine.

This is important, as crawling and rendering are necessary steps that lead up to indexing.

Changing priorities in Google’s process

For indexing purposes, Google used to crawl with the desktop Googlebot. That has been changed; they now use the smartphone Googlebot for indexing purposes.

Mobile-First indexing will be imposed starting in July 2019 for all new sites, and is coming up for all known existing sites if they haven’t already been switched.

Crawl: ways Google finds URLs to index

To be able to index a page, Google has to crawl it.

As the first step in the process leading up to indexing, to make sure your pages get indexed correctly and quickly, you need to make sure that your crawling is “safe and sound”.

There are basically three ways Google finds URLs:

Links: this is what the whole PageRank patent was based on–finding new sites through hyperlinks
XML site maps
Past crawls

– How Google prioritizes URLs (Crawl budget)

Google prioritizes which sites its crawls, and how often. This is often referred to as “crawl budget”.

There was an article in the Google Webmaster blog about crawl budget that gave a few ideas as to how Google prioritizes which sites to crawl.

– Popularity: backlinks and PageRank

One of the points established by this article is that PageRank is a main driver behind indexing speed and volume for a website.

Backlinks, of course, are a major component of PageRank, and therefore have an influence on crawl rate and indexing.

– Status codes

Status codes are also taken into account. For example, if you have a lot of 404 pages on your site, that will likely lead Google to reduce the frequency of crawls.

Another example are redirect chains and loops.

– Site hygiene

If your site is organized in a way that wastes a lot of crawl budget, Google might reduce how much time it spends on your site.

– Page speed and server response time

Crawl budget it also impacted by page speed and server response time. Google doesn’t want to DDoS your site; if it sees that your server has a hard time providing pages and resources at the rate it requests them, it will adjust to what your server can handle in terms of crawling.

Rendering: Caffeine update

The Caffeine update that came out a few years ago was basically an update to Google’s rendering structure.

Indexing: Different clusters for content types

There are different archives of indexes that Google uses to return different results. It’s reasonable to imagine that there are different clusters in the index for news results, and another for image results, etc.

Ranking: Separate algorithms

Finally, indexed URLs are ranked–but this is a totally different algorithm.

Improving indexing speed

Both getting pages indexed faster and getting more pages indexed are heavily influenced by PageRank and therefore by backlinks. But the strategies to improving each one are different.

If you want pages to get indexed faster, you want to optimize the first two steps (crawling and rendering). This will include components like:

Internal linking
Sitemaps
Server speed
Page speed

Improving number of pages indexed

If you want to get more pages indexed, that’s where the crawling aspect is more important. You will want to make it easier for Google find all of your pages. This is simple on a small website with a thousand URLs, but is much harder on a larger site with millions of URLs.

For example, G2 has a ton of pages of different page types. Kevin’s SEO team wants to make sure that Google is able to find all pages, no matter the crawl depth and no matter how many pages of that type exist; this is a major challenge that has to be approached from different angles.

Variation in crawl rates according to page profile

Based on the type of page, Kevin often finds different crawl rates by Google. This often depends on the URL’s backlink profile and internal linking. This is where he finds the most use of log files.

He segments his site by page type in order to understand where the site lacks crawl efficiency or where crawl efficiency is too high.

Relation between crawl rate, indexing speed, and rank

Kevin has absolutely observed definite correlations between crawl rate, indexing speed, and rank for each type of pages. This has been true not only across the sites that he has worked with, but also in correspondence with other SEOs in the industry.

Without positing a causality between crawl, indexing, and ranking, similar elements that drive indexing also appear to be taken into account when it comes to ranking a page. For example, if you have a ton of backlinks to a certain page template for a given type of page (example: landing pages), what you will find in your log files is that if Google has a higher crawl rate on these pages across your site, Google also indexes these pages faster and usually ranks these pages higher than other pages.

It’s hard to make universal statements that are valid for all sites, but Kevin encourages everyone to check their log files to see if this is also true on their own site. Oncrawl has also found this to be the case across many different sites they have analyzed.

This is part of what he tried to outline with the TIPR model of internal linking that he came up with.

Measuring crawl rate

To measure crawl rate, you want to answer the question: how often does a given Googlebot come to visit a certain URL?

How you “slice and dice” this another question. Kevin likes to look at the number of Googlebot hits on a weekly basis. You can also look at it on a daily or a monthly basis.

– Focusing on before/after

More important than the period you use is looking at changes in the crawl rate. You should look at the rate before you make changes and after they are implemented.

– Focusing on differences between page types

Another key to measuring crawl rate is looking at where the gaps are on your site. On a page-type level, where are the differences between crawl rates? What pages type is crawled a ton? What pages types are hardly crawled?

– Common observations in crawl behavior

Some interesting observations that Kevin has made in the past include:

Most crawled URL: robots.txt
Most time spent on a URL/group of URLs: XML sitemaps, especially when they get a bit larger

Digging through log files to find differences in crawl behavior between page types is super eye-opening. Look for what URLs are crawled on a daily basis vs which URLs are crawled on a monthly basis. This can tell you a lot about how efficient the structure of your site is for crawling (and indexing–even though there’s a step in between).

Distribution of crawl budget based on business model

To improve crawl efficiency, the strategy is usually to reduce the attention Google gives to some types of pages and redirect it to pages that are more important the the website.

The way you want to handle this will depend on how conversions are handled on the site. Kevin distinguishes two basic site models: centralized and decentralized business models:

Decentralized models can convert users on any page. A good example is Trello: you can sign up on any page. All of their page types are relatively similar. Because no page is more valuable than another for signups, the objective might be to have an even crawl rate across the whole site: you want all types of pages to be crawled at roughly the same rate.
Centralized models might be something like Jira. Jira doesn’t have a single page type that we can replicate a million times: there are only a few landing pages where people can sign up. You want to make sure that your crawl budget on a site like this is concentrated around your points of conversion (your landing pages).

How you want your crawl budget distributed comes back to the question of how your site makes money, and which types of pages play the most important role in that.

Addressing crawl waste

To keep Googlebots from spending crawl budget on pages that are less important to conversions, there are several methods.

The best way to skip crawling is robots.txt:

In 99.99999% of the cases, Google respects robots.txt directives.
Robots.txt can help block crawling on large sections of your site with thin or duplicate content (Classic examples: user profiles on a forum; parameter URLs…)

There are legitimate cases where you might want a page to not be indexed, but to still help with crawling. Kevin would consider some hub pages to fall into this category. This is where he would use a meta noindex.

He recognizes that John Mueller has said that meta noindex tags are eventually treated as nofollow, but Kevin has so far never seen this happen on the ground. He admits that this might be because it takes a very long time to happen (over a year, or longer). Instead, he tends to find Googlebots to be “greedy” and to search out and follow as many links as they can.

Kevin’s advice is to use robots.txt, and to use it to its full extent. You can use wildcards and some very sophisticated techniques to shield certain things from being crawled.

The rule of thumb to follow is that the thinner the content, the more likely it is to be a candidate to exclude from crawling.

Pages excluded from crawling through robots.txt are still indexable by Google if they have internal links or backlinks pointing to them. If this happens, the description text in the search results will show that Google was unable to crawl the page due to a restriction in robots.txt. Generally, though, these pages don’t rank highly unless they’ve only recently been excluded in robots.txt.

Indexing issues due to similar pages

– Canonical errors

Programmatically, canonical declarations are extremely easy to get wrong. Kevin has seen the case a few times where the canonical has had a semicolon (;) instead of a colon (:) and then you run into tons of problems.

Canonicals are super sensitive in some cases and can lead Google to distrust all of your canonicals, which can then be a huge problem.

One of the most common problems with canonicals, though, is forgotten canonicals.

– Site migrations

Site migrations are often a source of problems with canonicals; Kevin has seen issues where the site has just forgotten to add the new domain to the canonicals.

This is extremely easy to forget, particularly when your CSM needs a manual (rather than programmatically) adjustment to make the change during a migration.

The default setting is that a page’s canonical should point to itself, unless there’s a specific reason to point to another URL.

– HTTP to HTTPS

This is another common canonical error that prevents the right URL from being indexed. The wrong protocol is sometimes used in the canonical.

– Finding source of error when Google ignores the declared canonical

Google will sometimes choose its own canonical. When they mistrust your declared canonical, there is usually a root cause.

Kevin suggests avoiding situations where you might be sending two conflicting signals to Google:

Look into your XML sitemaps
Crawl your own site and search for faulty canonicals
Look at parameter settings in your Search Console to find conflicting settings
Don’t use noindex and canonicals at the same

Types of pages that contribute to index bloat

In SEO ten years ago, you wanted to send as many pages as possible to be indexed: the more pages indexed, the better.

Today, that’s no longer the case. You only want the highest quality stuff in your shop. You don’t want any sub-par content in the index.

“Index bloat” is usually used to describe a page type that provides no value. This often comes back to any sort of thin content, particularly cases where you multiply or amplify the number of existing pages without providing substantial value on each new page.

Classic cases where you might want to look at how many of a specific type of page are indexed, and whether they provide additional value include:

Parameters
Pagination
Forums
Directory-related pages or doorway pages
Extensive local (city) pages that don’t differentiate between services or content
Faceted navigations

How indexing affects a site as a whole

You don’t want to have subpar pages indexed today because they affect how Google sees and rates your site as a whole.

Much of this comes back to crawl budget. While Gary Illyes and John Mueller have often said that most sites don’t need to worry about crawl budget, the audience for the type of discussion we’re having today are larger sites where it makes a big difference.

You want to make sure that Google only finds high-quality content.

Like the relationship Kevin observes between crawl rate, indexing, and ranking, he also observes that paying attention to the quality of indexed pages seems to pay off for the entire site. While it’s difficult to make universal statements, it seems that Google has some kind of site quality metric that is dependent on the indexed pages for that site. In other words, if you have a lot of low-quality content that is indexed, it seems to hurt your site.

This is where index bloat is detrimental: it’s a way to dilute or lower your overall site quality “score” and it wastes your crawl budget.

XML sitemaps for quick indexing

Kevin’s opinion is that as Google has gotten smarter, the number of “hacks” has shrunken over time.

However, on the subject of indexing, he’s found that one way to get something indexed quickly is to use an XML sitemap.

Recently G2 migrated to a new domain. They have one page type that takes a long time to be recrawled, so in the Google’s index you still saw the old domain in the snippets for pages of this type. When Kevin saw that the 301 redirects were not taken into account because they hadn’t been crawled yet, he put all of the pages of this type into an XML sitemap and provided the sitemap to Google in the Search Console.

This strategy can also be used if there’s a big technical change on the site that Kevin wants Google to understand as quickly as possible.

Growing prominence of technical SEO

Technical SEO has gained prominence over the last three years. Many times, technical SEO questions highlight areas that are really underrated.

Often you hear that content and backlinks are the only things you need to take care of. While Kevin believes these are super impactful fields of SEO, he thinks they can have even more impact if you’ve gotten your technical SEO right.

[Ebook] Crawlability

Ensure that your websites meet search engine requirements for crawlability to boost SEO performance.

Read the ebook

Q&A

– Bing and indexing 10,000 URLs/day

Bing offers webmasters the ability to directly submit up to 10,000 URLs per day through their webmaster tools for faster indexing.

Kevin believes this is a direction in which Google may also be headed. Even Google, as one of the most valuable companies in the world, has to safeguard their resources. This is one of the reasons why, if you waste their crawl resources, they will adjust accordingly.

Whether or not this sort of feature is worthwhile for webmasters will also depend on the size of your site. The number of sites that would benefit from being able to submit so many URLs per day is limited–probably in the thousands or ten thousands. Kevin presumes that for these sites, Google already dedicates significant resources. It seems that for the largest sites on the web, Google does a decent job of indexing them, with the usual exceptions, of course.

It’s likely much easier for Bing to implement something on this scale: for one thing, their market share is a lot smaller, so the demand for this feature is less. Their index size is also likely a lot smaller, so they probably stand to benefit more.

– When Google ignores robots.txt

Google only very rarely ignores robots.txt.

Sometimes what leads us to assume that Google is ignoring robots.txt is that, as we talked about before, Google can sometimes index pages that are blocked by robots.txt, which can still be found through multiple other ways.

You might also be able to get Google to ignore directives in your robots.txt if your syntax in the robots.txt file is incorrect:

Erroneous characters
Use of tags that don’t or shouldn’t work, such as noindex directives

[Note: Kevin cites a case study that found that Google respected noindex directives presented in the robots.txt file. However, shortly after this webinar aired, Google announced the end of tacit support for this directive in robots.txt files, effective September 1, 2019.]

However, Google is one of the companies that holds their bots to a high standard and doesn’t ignore robots.txt.

Top tip

“PageRank is the main driver behind indexing speed and volume.”

SEO in Orbit went to space

If you missed our voyage to space on June 27th, catch it here and discover all of the tips we sent into space.

Rebecca Berbel See all their articles

Rebecca is the Product Marketing Manager at Oncrawl. Fascinated by NLP and machine models of language in particular, and by systems and how they work in general, Rebecca is never at a loss for technical SEO subjects to get excited about. She believes in evangelizing tech and using data to understand website performance on search engines. She regularly writes articles for the Oncrawl blog.

Comments are closed.