- Use cases
- Customer Success
- LOG IN
- Start free trial
The feature discussed in this article is getting a new look! This article will be updated in June 2019 to reflect changes in OnCrawl.
Is Google flagging duplicate content on your site and not accepting the canonical URLs you’ve put in place? This can often happen when you’ve correctly dealt with only part of a set of pages that can be seen as duplicates.
With this walk-through of how OnCrawl processes and reports on pages that can be classified as duplicate content, we hope you’ll gain a new way of looking at how to manage duplicate content on your website. It’s quick to set up: all you need is a crawl. And it’s quick to analyze: you need to be able to identify stoplight colors, and judge relative sizes of rectangles.
In addition to flagging on-page duplicate issues, such as reused titles, descriptions, and H1s, OnCrawl measures the level of similarity of all of the pages it crawls. OnCrawl uses the Simhash algorithm to establish page similarity, the same way Google does.
Pages that are somewhat similar are grouped together. We call this group a cluster of pages with duplicate content. Within a cluster, all pages are similar to one another. Here is a representation of one cluster with 3 pages in it:
Then, OnCrawl presents all of the clusters in the same chart. The bigger the rectangle in this graphic, the more pages in the cluster :
Finally, OnCrawl matches the cluster to your use of canonical URLs. Canonical declarations are one way of signaling to Google that you’re aware that content is similar, and which of the similar pages is intended to be the most authoritative version.
Each cluster is colored based on the analysis:
This lets you quickly judge whether your strategy for managing duplicate content is adequate or not.
The slider at the top of the graphic helps you narrow down the view to only clusters of a specific size.
You can use a slider to filter out clusters whose average similarity you’re not interested in. For example, you can concentrate on only clusters with a similarity of more than 80%.
Most sites will need to use a combination of the following three strategies to effectively manage duplicate content. Here are a few clues to spotting a well-implemented strategy.
Managing duplicate content by differentiating between pages: page content is modified so that the pages no longer appear similar.
Managing duplicate content by using canonical declarations: canonical URLs are declared for all similar pages and only the canonical URL is indexed.
Managing duplicate content by closing duplicate pages to crawl and indexing: instructions to robots, particularly the meta robots noindex tag, are used to prevent indexing of duplicate pages.
If your strategy for managing duplicate content is not working, here’s one way of using the chart to find a place to start correcting it:
Duplicate pages are one of the core issues in SEO: pages that are too similar can compete against one another for rankings on the same query, or duplicate pages may not be indexed in favor of a single authoritative version chosen by you (or, increasingly, by Google). You can prevent these issues and increase your chances to have Google accept your canonical declarations by using simple aids.
OnCrawl helps you to:
Not an OnCrawl user yet? It’s a perfect time to start your free trial, gain insights from real data from your website, and benefit from expert help from the Customer Success Managers at OnCrawl.