Page Importance is an algorithm that determines a score used by Google to classify urls during crawl sessions. Page Importance ranges urls to be explored in order to optimize the crawl budget affected to each site.
To better distribute its crawl expenses, Google needs to prioritize pages it needs to fetch. In that logic of cost optimization, the Mountain View firm has published many patents about crawls scheduling. These patents let us know better understand how your pages are categorized and give perspective to the crawl budget concept – a concept that is differently observed regarding each site, typologies of pages and related on/off-site metrics.
Google crawl frequency by groups. Each section does not have the same importance to Google’s eyes
Why can’t Google crawl all web pages?
According to InternetLiveStats, there are today more than 1.2 billions websites, each of them being able to have from a few to millions of pages to index. Counting all resources, from images to css, that Google tries to analyze and understand, it represents a huge amount of data to interrogate. It is clear that, even with hundreds of data centers, Google needs to make choices in its exploration. These choices rely on algorithms and a set of metrics that are important to know and master to leverage your SEO efforts.
Google should analyze 4 millions pages by second to cover all its index in 1 year (estimated to 130 thousands billions pages) and for just one unique update per year and per page.
It’s just impossible and one unique update per year and per page is non productive for Google!
Fetching all pages to maintain its index’s freshness and exhaustivity – and to give the best answer possible – implies to come back on the same pages several time a day. Logically, it is expansive regarding time of processing and energy. Every company thus knows that optimizing operating costs is important to assure cost-effectiveness. Crawls planification and priorisation regarding pages importance is important.
Google types of crawls and crawlers functioning
We know that Google is not exploring every typologies of pages the same way. For example: the homepage, rss feeds and section pages are real freshness reservoir. Google frantically visits them. Product pages and articles, on the other hand, are knowledge sources. Google will evaluate the quality and will visit them with a frequency relying on a score related to a set of data: that is the page importance score.
Google knows pages depth, their update frequency, the internal popularity, the volume of content and the HTML semantic quality of these pages. He then adapts its crawl budget’s repartition on these pages to discover new documents or to update its index.
Crawl rate by depth
An online media for instance, classified by Google as such, is going to see increases of its visit frequency of some pages. Hot contents are naturally placed on depth 1 (the homepage) and 2 (section headers). Most of crawl budget’s resources are going to be spent on these pages and then on new discovered URLs.
Then, regarding the content richness, HTML semantic, number of links, loading speed (related to the weight of resources and the server abilities) and other factors such as the PageRank and Inrank (OnCrawl metric), bots will be sent on specific pages.
How does the Google crawl really work? #RTFM
Google’s crawl is a set of simple steps recursively operating for each site. Its goal is to precisely and exhaustively fill its index. Each crawl is just an unstacking of a list of urls to fetch in order to verify their updates. This list of urls is made beforehand and needs to be optimized to avoid fetching less important documents.
According to the following schemes, published by the Google Search Appliance Documentation (source), Google can correctly and quickly answer to a request only if it builds a research index of your pages from a crawl. This method is supposedly also used for indexing the web.
Before anyone can use the Google Search Appliance to search your enterprise content, the search appliance must build the search index, which enables search queries to be quickly matched to results. To build the search index, the search appliance must browse, or “crawl” your enterprise content, as illustrated in the following example.
1-Identifies all the hyperlinks on the page. These hyperlinks are known as “newly-discovered URLs.
2-Adds the hyperlinks to a list of URLs to visit. The list is known as the “crawl queue »
3-Visits the next URL in the crawl queue.
Left: Google’s crawl algorithm simplified visualisation
Right: complete algorithm
How to pick important pages read during a crawl session?
By reading Google’s patents, we can notice that many publications related to crawlers take into account crawls’ scheduling elements. Google machine resources’ planification is then based on data processing algorithms that we are going to analyze in this article.
Our research are based on three important terms coming from these patents analysis: the “Crawl Budget”, the “Crawl Scheduling” and the“Page Importance”.
The first one is not “official” according to Google – which has however recently explained this concept in an official post. The two other terms are cited on these patents (source and source), letting Google be more efficient in its web exploration.
Extracted document from the page importance patent where importance score and crawl planification concepts are reunited.
Google makes choices to plan its exploration. This is where the Page Importance algorithm comes up. It helps select the most relevant urls and plan crawl sessions on each site. It then reduces the number of non-relevant pages to explore, offering a better optimization of Google expenses while maintaining the index’s quality and freshness.
Google uses (as OnCrawl) a quite simple discovery/crawl/indexation method. It tries to completely browse a website regarding the server ability to answer the “host load” and to detect the most important pages. This strategy is based on a set of algorithms compiling onsite data. Content is indexed and Google’s revisit is made on the user’s most essential pages or on pages better corresponding to high interest requests, to the latest published pages, the most updated or with the best quality.
New discovered urls are interrogated, but as their content is more cold, once treated these pages will receive more sporadic visits.
How can Google estimate the importance of fetching resources?
Through its exploration, Google is using important metrics to evaluate the importance of a page or a group of pages compared to another.
Here is the list of factors taken into account:
- Tree view page’s position;
- Page Rank ;
- Type of page or type of file;
- Sitemap.xml url inclusion;
- InRank (internal page rank) ;
- Number and variation of internal links;
- Relevance, quality and size of content;
- Update frequency;
- Source codes and overall website quality.
As an SEO, you already know these optimizations. They can play their full role with the page importance concept.
Page importance and crawl budget, how related?
Page importance is a score used by Google to classify urls to interrogate, the most important ones first. Page Importance helps analyze scheduled urls during crawls regarding each website’s crawl budget.
Crawl budget, as seen in the Google Search Console is a macro view of the crawl. This page’s exploration curve takes into account hits on CSS, JS resources and pages with 40X errors or 3xx redirections and for all crawlers (web but also adwords, adsence, images, news or video). This information is thus too wide to be clear. Log analysis is the only way to know how Google is really behaving. Data about crawl frequency help evaluate if your page importance scores are on the same level as your money pages.
Crawl budget is the result of the host load and the crawl/url scheduling. In other words, the limit of hits that Google attributes per day to explore relevant pages. Page Importance thus helps optimize your crawl budget.
How to maximise your page importance?
Optimise load time
The first lever is the “host load”. Decreasing your response time, using CDNs and caching servers and 304s on resources drastically help reduce load times and maximize your host load.
The more quickly a page is loaded, the higher its crawl frequency is.
Be sure that indexation bots don’t meet obstacles during site’s exploration by following returned status codes.
Optimise internal linking
Essential pages to your business need to receive links from the most important pages – like the homepage for instance. Every pages receiving a link from the homepage (pages in depth 2) are your most important pages. Build evergreen links going to your pages with high potential.
Foster different types of documents
Using PDFs linking to your important pages can be a point of improvement but implies a clear duplicate content management to avoid copying your site’s content. We noticed that pages containing HTML tables listing data are more often crawled. That is a point of improvement for e-commerce vendors who can play with HTML semantic quality to foster product pages’ crawl.
Keep your sitemap.xml updated
Keeping your sitemaps.xml files updated is an often neglected task. However, getting your pages in these documents can help you maximize the importance of these pages.
Pages included in the sitemap.xml vs the ones outside but still in the architecture.
Reduce non relevant links
Being able to create pages that don’t have too many links is the most important lever. Get rid of mega-menus and footer links after the homepage. These highly duplicated blocs of links are reducing the power that each link is passing, they are often badly optimized and contribute to depreciate the Inrank of pages’ silo. To maximise the power of each link, pages need to have a reduced amount of outbound links.
Known the linking contribution of a group of pages to another help optimise the overall internal linking
Optimise content and volume
Creating pages with rich content and semantic data is a first priority optimisation. The longer an article is, the higher the importance score given by Google will be. It is the same for categories that should not be just pages full of links but with an important volume of text.
The more text a page has, the more information it will give to users and the more Google will come by.
When optimizing a website’s visibility, it is crucial to understand the search engine’s algorithms you are targeted. We know that combined analysis of logs and crawl help the OnCrawl users to compare the previously mentioned metrics with the real Google behavior on a website.
A log analysis lets you follow the Google bots visit frequency on pages and determines barriers during site exploration. You can live monitor returned codes and the weight of pages.
Crawling your website will give you a set of pages and metrics to monitor for page importance: depth, included urls in sitemap.xml, inrank, number and quality of links, HTML and semantic quality of each page.
Finally, the combined analysis will help you easily monitor the crawl frequency by page importance’s metric. Accessing the data is easier and you can you save time.
Don’t waste time, monitor your page importance with OnCrawl!