A well-organized site enables search engines to crawl more efficiently, ensuring that valuable content is discovered, indexed, and ranked in search results.
By contrast, a complex or poorly structured site can impede this process, wasting crawl resources allocated to the website (commonly referred to as crawl budget) and diminishing the site’s visibility online.
Your website’s architecture can either facilitate or hinder Google’s ability to allocate crawl resources effectively.
Crawl budget, or as I prefer to call it crawl resources, refers to the number of pages Google will crawl within a given timeframe on a specific website.
This budget is not infinite; which is why understanding its dynamics is critical in understanding how Google discovers new content (URLs) and content updates.
Factors such as site speed, the freshness of content, the quality of the content, and the site’s authority can influence how Google assigns crawl resources.
The relationship between quality and crawl resources, is in my opinion, an often overlooked and lesser talked about area of SEO. We know that quality thresholds exist for indexing, and we can also see from tests and years of looking at data that Google can perform a form of “fingerprinting” on a website’s URL structures.
What is URL fingerprinting?
URL fingerprinting is a process used by Google to analyze and categorize web pages based on their URL structure.
This method allows Google to identify patterns that suggest the potential quality, relevance, and uniqueness of content.
By examining the structural elements of a URL, including path directories, query parameters, and naming conventions, Google’s algorithms can infer the likelihood of a page containing valuable or duplicative content.
This assessment plays a pivotal role in determining whether a page is worth crawling, indexing, and ultimately, ranking in search results.
We see this a lot on websites that suddenly publish a large number of URLs using programmatic content, and more recently in large-scale AI or AI-assisted published content.
Google’s use of URL fingerprinting
Google’s primary goal in indexing content is to enhance the user experience by delivering relevant, high-quality search results.
URL fingerprinting serves as a filter to achieve this goal, helping to screen out low-quality content before it consumes valuable crawl resources.
For instance, Google might identify URL patterns associated with dynamically generated pages that typically offer little unique value (e.g., session IDs, tracking parameters) and deprioritize their crawling.
This also then ties to your website’s perceived inventory.
If you go from being a 2,000 URL website to a 3,000 URL website overnight, you’ve greatly increased your resources ask of Google. If Google starts to crawl these news URLs and identifies a percentage of them being low quality, it may preemptively gauge and withdraw or deprioritize resources from crawling the remaining URLs on the basis they may be of a similar low quality.
The symptom of this is the appearance of two common Google Search Console index statuses:
- Crawled – currently not indexed
- Discovered – currently not indexed
[Case Study] Increase visibility by improving website crawlability for Googlebot
Crawled – currently not indexed
When Google Search Console reports a URL as “Crawled – currently not indexed,” it indicates that Google’s crawler (Googlebot) has visited and crawled that specific page, but has chosen not to include it in the search index. This is more often than not down to:
- Content quality: The content might not meet Google’s quality guidelines. It could be seen as thin, duplicate, or lacking in value to users.
- Technical issues: There might be technical issues with the page that prevent it from being indexed, such as improper use of noindex directives or other signals that discourage indexing.
- Staleness: URLs can drop out of the index if freshness is an important, and highly weighted factor in what Google perceives as quality for the search terms and user objectives that the URL is targeting.
Discovered – currently not indexed
This status indicates that Google is aware of the URL (it has been discovered, likely through sitemaps or links from other pages), but has not yet crawled or indexed the page. From experience, this is likely down to:
- Crawl budget constraints: If a site has a large number of pages, Google might prioritize which pages to crawl based on factors like site structure, page importance, or freshness. As a result, some discovered pages might wait longer to be crawled and indexed.
- Low priority: Google may assess the priority of crawling certain pages over others based on various signals. If a page is deemed low priority, it may remain in the “discovered” state for some time. This can be that the page has been processed directly and judged low priority, or the URL path it sits on has been judged low priority.
- Temporary technical issues: Occasionally, temporary issues (such as server unavailability or errors) can delay the crawling process, leaving pages in the discovered but not crawled state.
Wrapping up
The architecture and organization of your website play crucial roles in the efficiency and effectiveness of search engine crawling.
A well-structured site can greatly enhance the allocation of crawl resources, ensuring that valuable content is readily discovered, indexed, and ranked.
By comparison, a poorly organized site can squander these resources, leading to diminished online visibility.
Understanding the concept of crawl budget—or crawl resources—and the factors influencing it, such as site speed, content freshness and quality, and site authority, is critical for optimizing how Google discovers and values your content.