We’ve been pretty busy @Oncrawl for the past few weeks and I am glad to tell you something that we are really proud of.
We have launched the first ever semantic duplicate content detector dedicated to SEO. Look further than just duplicate titles and metas.
As Google has announced several updates, chances are that your site usage metrics and the quality of your content would impact a lot your SEO. That’s why we tried to adapt our state-of-the-art semantic analyzer to the SEO world. We knew that Duplicate Content can harm your rankings. And today SEO professionals need a better solution than just checking if their titles or meta-descriptions are duplicated. Indeed, it’s easy to have a unique Title for each of your page, and SEOs have been pretty good at this. But it’s slightly different when it comes to your core content. Are you sure you don’t have any large string or paragraph that overlaps the content of several pages? Obviously not all of us can answer this question. So we build that brand new feature available in the “Content Tab” in your Oncrawl account.
How can you handle to compare each of your webpage to another if you have tons of them? It’s impossible for a human. And that’s why we trained our bots! To understand how we compute your website data regarding duplicate content. I am happy to share some thoughts from our CTO, Tanguy Moal.
Hey Tanguy, you’ve been working on a DC detector, can you explain the scientific approach we set up to detect duplicate content?
We use standard technology to perform the detection of near duplicates. As a matter of fact, we took inspiration from Google’s simhash  publication. We added some in house ingredients to the recipe such as taking 2-grams into account so we could reduce false positives rates.
What are the bad habits you have spotted to date?
Good question indeed. It depends on the analysed website of course. For online retailers, having a page dedicated to each color / variation of a given product is clearly an issue. Very small pages and pages where the template (header, footer and sidebars) account for more than 75% of the content is also a bad signal. Even if Google clearly use HTML pattern analysis to detect your core content, it can be an issue. We are working hard to provide some cool features on that part.
When Oncrawl spots a cluster of pages that has some duplicates, how to prioritize the work, because there can be plenty of them?
I think the main things to check is either your clusters have a canonical URL set up, and if it’s matching the same URL for the whole cluster. Second, you should check if a pages within the clusters are set to noindex. Then I would focus on clusters that contains only 2 pages, it’s often an issue caused by a parameter not well implemented in your URLs (pagination for instance, or sort parameter, or a trailing “space”…) so correcting it should be easy. Finally I would check the clusters containing some of your “money pages” because if you have an issue on them, you are probably loosing money every day.