We’ve been pretty busy @Oncrawl for the past few weeks and I am glad to tell you something that we are really proud of.
We have launched the first ever semantic duplicate content detector dedicated to SEO. Look further than just duplicate titles and metas.
As Google has announced several updates, chances are that your site usage metrics and the quality of your content would impact a lot your SEO. That’s why we tried to adapt our state-of-the-art semantic analyzer to the SEO world. We knew that Duplicate Content can harm your rankings. And today SEO professionals need a better solution than just checking if their titles or meta-descriptions are duplicated. Indeed, it’s easy to have a unique Title for each of your page, and SEOs have been pretty good at this. But it’s slightly different when it comes to your core content. Are you sure you don’t have any large string or paragraph that overlaps the content of several pages? Obviously not all of us can answer this question. So we build that brand new feature available in the “Content Tab” in your Oncrawl account.
How can you handle to compare each of your webpage to another if you have tons of them? It’s impossible for a human. And that’s why we trained our bots! To understand how we compute your website data regarding duplicate content. I am happy to share some thoughts from our CTO, Tanguy Moal.
Hey Tanguy, you’ve been working on a DC detector, can you explain the scientific approach we set up to detect duplicate content?
We use standard technology to perform the detection of near duplicates. As a matter of fact, we took inspiration from Google’s simhash [1] publication. We added some in house ingredients to the recipe such as taking 2-grams into account so we could reduce false positives rates.
What are the bad habits you have spotted to date?
Good question indeed. It depends on the analysed website of course. For online retailers, having a page dedicated to each color / variation of a given product is clearly an issue. Very small pages and pages where the template (header, footer and sidebars) account for more than 75% of the content is also a bad signal. Even if Google clearly use HTML pattern analysis to detect your core content, it can be an issue. We are working hard to provide some cool features on that part.
When Oncrawl spots a cluster of pages that has some duplicates, how to prioritize the work, because there can be plenty of them?
I think the main things to check is either your clusters have a canonical URL set up, and if it’s matching the same URL for the whole cluster. Second, you should check if a pages within the clusters are set to noindex. Then I would focus on clusters that contains only 2 pages, it’s often an issue caused by a parameter not well implemented in your URLs (pagination for instance, or sort parameter, or a trailing “space”…) so correcting it should be easy. Finally I would check the clusters containing some of your “money pages” because if you have an issue on them, you are probably loosing money every day.
[…] Duplicate content is one of the main issues webmasters can meet. It refers to content that appears in more than one place on a website. Duplicate content can be inside but also outside of your website. In fact, it causes troubles to crawlers since it is impossible to tell which piece is more relevant than the other for a given query. Taking UX into consideration, bots will not display multiple pages and are obliged to choose the one likely to be the best. It leads to an important loss of relevant results on search engine results and so on a loss of traffic. OnCrawl now knows how to detect duplicate content. […]
[…] to avoid duplicate content issues, use this free tool to quickly know if you haven’t already written something similar on […]
[…] the AMP HTML version you are going to create is a duplicate of existing contents. And you know that duplicate content is harmful for your SEO. You need to insert a <link rel=”canonical” href= […]
[…] Content is ROI. That sentence has become a watchword. To be well ranked thanks to your content, this one needs to be valuable, informative, rich and unique! Otherwise, some people has developed strategies relying on duplicating content, spinning it in order to make their content rich and fresh. More than just violating Google Panda guidelines, recycling your content will at the end bore your audience. Why would you do this if there is a risk losing a qualified traffic and your rankings? You can still get rid of this poor content using OnCrawl and its duplicate content detector. […]
[…] de couleurs pour permettre une modification rapide et facile. OnCrawl: pour éviter les problèmes de duplication de contenu, servez vous de notre outil pour savoir rapidement si vous n’avez pas déjà écrit quelque chose […]
[…] Le contenu est ROI. Cette phrase est devenu le mot d’ordre. Pour être bien classé grâce à votre contenu, celui-ci doit être de valeur, informatif, riche et unique ! Cependant, certaines personnes ont développé des stratégies qui se basent sur la duplication de contenu, le recyclant afin de le rendre riche et récent. En plus d’enfreindre les directives de Google Panda, recycler votre contenu va barber votre public. Pourquoi prendre le risque de perdre un trafic qualifié et vos classements ? Il est toujours possible de se débarrasser de ce contenu pauvre en utilisant OnCrawl et son détecteur de duplication de contenu. […]
[…] some cases, other elements need to be checked like micro data, load time, mobile compatibility, duplicate content, etc. A good technical optimization is essential (NB: french source) to ease robots’ crawl […]
[…] éviter les problèmes de duplication de contenu, servez vous de notre outil pour savoir rapidement si vous n’avez pas déjà écrit quelque chose […]