Out of the middle of a crowded high school hallway, Joey Donner appears before Bianca (who has been seriously crushing on him for about 15 movie minutes of 10 Things I HAte About You), pulls out two nearly identical photos and forces her into choosing which she prefers.
Joey: [holding up headshots] “Which one do you like better?”
Bianca: “Umm, I think I like the white shirt better.”
Joey: “Yeah, it’s-it’s more…”
Bianca: “Pensive?”
Joey: “Damn, I was going for thoughtful.”
Like Bianca, search engines must make choices – black tee or white tee, to rank or not to rank (#ShakespearePuns!). And according to Introduction to Information Retrieval (c19) “by some estimates, as many as 40% of the pages on the Web are duplicates of other pages” – which accounts for a tremendous amount of wasted storage and overhead resources (for little return #AintNoBotGotTimeForThat)!
On the surface, the solution is simple: Don’t be Joey Donner, making search engine bots pick between two identical results. However, diving deeper into Joey’s psychological state — he doesn’t know he’s being redundant. He doesn’t realize that he’s presenting the same photo and a sticky situation to Bianca. He is simply unaware. Similarly, duplicate content can sprout out from a multitude of unexpected avenues and webmasters must be vigilant to ensure that it does not interfere with the user and bot’s experience. We must be mindful and purposeful about not being another Joey Donner.
As outlined in Google Webmaster Guidelines, duplicate content is “substantive blocks of content within or across domains that either completely match other content or are appreciably similar.”
It’s important to understand that pages that appear virtually the same to a user should not affect their site experience; however, pages with highly similar content will affect a search engine bot’s evaluation of those pages.
Duplicate content causes a few issues, primarily around rankings, due to the confusing signals that are sent to the search engine. These issues include:
Duplicate content is often created unintentionally (most times we don’t aim to be Joey Donner).
Below is a list of common sources that duplicate content may unintentionally arise from. It is important to note that although all of these elements should be checked, they may not be causing issues (prioritizing top duplicate content challenges is vital).
Common sources of duplication:
When dealing with cross-domain duplicate content, there is an “auction” as to the winner (making duplicate content within the SERPs, hypothetically, a winner-take-all situation). Gary Illyes, better known as the House Elf and Chief of Sunshine and Happiness at Google, mentioned that the auction occurs during indexation, before the content gets into the database, and it’s fairly permanent (so once you’ve won, you’re supposedly going to have an edge). This means that the first to publish content should theoretically be considered the winner.
This however, is not to say that duplicate content (whether on the same or across domains) will not rank. There are actually cases that exist, where Google determines that the original site is less suited to answer a result and a competing site is selected to rank.
Rankings depend on the nature of the query; the content available on the web to answer said query, your site’s semantic relevance towards a topic, and authority within the space (i.e., duplicate content is more likely to rank for more specific, related queries or queries that have a low amount of related content).
Theoretically, no. If content doesn’t add value to the users within the SERPs, it shouldn’t rank.
Focus on what’s best for the user. A grounding question — Is this answering your user’s questions in a way that is meaningful to your site’s overall experience?
If a site must have duplicate content (whether for political, legal, or time constraint reasons) and it cannot be consolidated, signal to search engine bots how they should proceed with one of the following methods — canonical tags, noindex/nofollow meta robots tags, or block within the robots.txt.
It’s also important to note that duplicate content, in and of itself, does NOT merit a penalty (note: this does not include scaper sites, spam, spun content, or doorway pages); according to Google Webmaster John Mueller in the October 2015 Google Hangout.
OnCrawl – I’d be remiss if I didn’t start with OnCrawl’s duplicate content visualization, because they’re the baddest is the biz (and by this I mean the best). One of my favorite aspects is how OnCrawl evaluates the duplicate content against canonicals. If the content is within a specific canonical cluster/group then the issues can typically be dismissed as resolved. Their reports also take this one step further and can show data segmented by subfolder. This can help to identify specific areas with duplicate content issues.
Plagiarism Tools – Thank your high school teachers and college professors for creating some of the most useful tools for evaluating duplicate content. While trying to identify haphazard students, they managed to create useful tools that applied to online duplicate content (providing percentages of similarity). A+!
Google Searches – Leverage quotes and search operators to find duplicate content potentially within your site and across the web. If Google can’t find it, then the issue can likely be dismissed.
Keyword Density Tools – When comparing content across pages use density checker visuals to identify topicality of a page. If the topic of the page isn’t coming through within the densities, the writing should be reviewed for clarity.
Google Search Console – Google Search Console offers countless useful tools to support webmasters. Chief of the duplicate content tools is Google’s URL Parameter report, which is designed to help Google crawl one’s site more efficiently by signaling how to handle URL Parameters.
TechnicalSEO.com’s Mobile-First Index Prep Tool – If you have a separate mobile configuration this tool is a good place to start a mobile/desktop parity audit to identify any discrepancies.
Solutions for duplicate content are highly contingent upon the cause; however, here are some tips and tricks. Duplicate content resolution requires a beautiful balance between technical SEO and content strategy.
Print out and label a top user journey to ground yourself in the example. Label each type in content the user could interact with in their journey and estimated time duration of each step. Not that they may be additional steps and that the path may not always be linear. Add arrows and expand, the goal is to ground oneself in a basic example before diving into complex/involved mappings.
Print out and write out goals, types of content, common psychological traits, content location, and what it would take to push the user to the next stage in their journey. The idea is to identify when users are interacting with certain content, what’s going through their minds, and how to move them along in their journey.
Print out and map points with type of content available, mapping out by vital binary points. Once you’ve mapped out all of your content, step back and see if there are any areas missing. Leverage this matrix to prioritize the most important content, whether it’s by conversion potential or by need.
Start with mapping informational to transactional intent on the y-axis. Non-brand and brand as more relevant.
Start with users that are more proactive versus reactive on the y-axis. Then transition to conversion potential. For services, this may look like “Seeks Expert” versus “DIY”.
Google’s Documentation on Duplicate Content
Introduction to Information Retrieval (chapter 19) – Stanford’s introduction into search engine theory’s book. This chapter covers theory behind how search engineers might resolve duplicate content issues, including concepts such as: a fingerprint and shingling. (pdf version, buy book on Amazon)
Duplicate Content Advice from Google – Hobo Web sifted through Google Hangout notes, Twitter comments, and Google documentation, painting a picture of Google’s position on duplicate content.