Oncrawl R&D: advanced analysis for unique vs duplicate content

March 27, 2019 - 5  min reading time - by Rebecca Berbel
Home > French > Oncrawl R&D: advanced analysis for unique vs duplicate content

Oncrawl is excited to present our duplicate content lab. Our R&D team is working on a new way of looking at duplicate and unique content on your site that will give you a more accurate way of approaching your editorial strategy.

Why focus on unique vs duplicate content?

Content is still one of the three most important ranking factors, and Google encourages websites to deliver insightful, unique and descriptive content to their visitors to offer the best user experience.

But not all content on a page carries the same weight. Google has always been pretty good and is getting even better at separating boilerplate–structural content such as your header, footer, navigational menus, and other repeating content–from the meat of the page.

In short, Google generally ignores the text of your template and ranks only your main content. This is why, instead of examining page word count, Oncrawl’s new experimental lab breaks content down into blocks, rather than pages.

Our data: what are content blocks?

Once we are done crawling your website, each web page is split into smaller blocks of text. A block of content is a set of words that occur in together within a single HTML node, such as anchor text, paragraphs, or items in a bullet list.

For each block, we calculate a uniqueness quotient and an occurrence ratio across the whole of your website. We continue to use the same algorithms that Google does, notably the Simhash algorithm, which allows us to compute degrees of similarity.

Using blocks of content, we can then identify the main content on a page. This is the content that is the least duplicated. This helps Oncrawl can provide answers to the following questions:

  • How much text on my site is unique page content?
  • How much is boilerplate?
  • If we exclude boilerplate and template text, is my content too thin?
  • Which pages would most benefit from copywriting efforts?

Our data: charts and data on content blocks

Because content blocks allow us to focus on unique content only, you can now look at a page’s uniqueness in relation to other pages on your website, and find the pages that contain too little unique content.

Data Explorer

In the Data Explorer, you can now examine the number of words and percent of words on the page per type of block:

  • Unique blocks
  • Blocks occurring on up to 25% of pages on the site
  • Blocks occurring on 25% – 50% of pages on the site
  • Blocks occurring on 50% – 75% of pages on the site
  • Blocks occurring on over 75% of pages on the site
  • Blocks occuring on all pages on the site.

These metrics are also available for segmenting your pages.

Crawl report metrics

In the crawl report, a new dashboard is available in the sidebar: Text block analysis. The charts in this dashboard give you an overview of how your site’s content breaks down by uniqueness quotient.

These charts can also be used in custom dashboards.

Which pages still have thin content once we remove templates and boilerplate content? Check the number of pages with under 300 words in unique blocks, regardless of the total number of words on the page. These pages have very little main content to offer–even if they occur on pages with a total of more than 1200 words:

Compare word count in unique blocks to overall page word count. Some pages with a low word count may still contain significantly more unique content than much longer pages, such as the pages in the first column on this site:

Evaluate the uniqueness per page by examining the portions of words per page that are found in each type of block. This helps answer questions such as:

  • On average across the site, how much of a page is boilerplate (orange and red)?
  • On average across the site, how much page content is duplicated (greens)?

Understand how many words are unique per page, and how that distribution plays out across other pages. This provides answers to questions such as:

  • How many pages have unique or nearly-unique content?
  • How many pages contain more than 1200 words of unique or nearly unique content?
  • On how many pages does boilerplate or template text account for more than 30% of the page text?
  • How many edge cases (pages with over half of their content in very similar blocks, or pages with over half their content in very unique blocks) exist on the website?

And analyze uniqueness by depth and by page group:

Our data: what is Oncrawl’s content overlay?

This new analysis comes with a visual overlay for each page crawled by Oncrawl.

The content overlay illustrates your content’s uniqueness by highlighting each block of HTML content on your web page using a color corresponding to its uniqueness.

Oncrawl uses the source code viewed by the bot at the time of the crawl, and overlays the uniqueness analysis for each block on the HTML source.

By hovering over a content block, you can view information such as:

  • The full text in the content block
  • The exact frequency of the content across the site
  • The number of times the block is used as anchor text for a link

This analysis can reveal sections of pages where content is copied and pasted, or where editorial policy has used copywriting templates without developing them. Conversely, it can also show how pages with little content manage to include originality without increasing their word count.

Building a content copywriting strategy based on page uniqueness

Go beyond word count when looking into content quality.

Oncrawl’s experimental new metrics are designed to allow deep analysis of editorial strategy:

  • Do you reuse similar content for pages aimed at different user search intent? It this content sufficiently adapted to the difference in intent?
  • Do pages on your site require large quantities of unique content to rank and perform well, or can short, unique pages achieve the same results?
  • Does repeated content (menus, footers, boilerplate text, disclaimers…) overshadow your main content?
  • Have you used a copywriting template without sufficiently adapting it for individual pages in a group of pages with high similarity, such as for agency/office locations?

Our R&D aims to allow you to explore your content in depth and from a new angle. We hope you will enjoy playing with this new data and that it will help you take your editorial strategy to the next level.

Rebecca Berbel See all their articles
Rebecca is the Product Marketing Manager at Oncrawl. Fascinated by NLP and machine models of language in particular, and by systems and how they work in general, Rebecca is never at a loss for technical SEO subjects to get excited about. She believes in evangelizing tech and using data to understand website performance on search engines. She regularly writes articles for the Oncrawl blog.
Related subjects: