technical SEO

SEO Rewards and Risks – Serving internal search results (or auto-generated content) to search crawlers

May 9, 2018 - 14  min reading time - by Matt Tutt
Home > Technical SEO > Serving internal search results to search crawlers

Ask an SEO what they’d recommend if a client’s website had an internal search function and 9 out of 10 times they’d tell you to “noindex” the search results page. The role of the noindex tag when placed in the head of a site, and if respected by search engines, will prevent those internal search pages being returned as results within Google or any other search engine.
My own opinion was (and still is) the same – it should *normally* be noindexed. But as with SEO there are many nuances and situations where this advice isn’t always so black and white, as much as we’d often like it to be.
Up until a few years ago the Google Webmaster Guidelines stated that these search results should indeed be noindexed to prevent crawlers accessing them, but this appears to have been removed from the most recent version of those guidelines.

old Google Webmaster Guidelines

Screenshot of the old Google Webmaster Guidelines where they mentioned the use of robots.txt to prevent crawling of search results pages (taken from the WayBack Machine in 2007).

 

Many SEO’s still believe this is the case, but from my observations in the wild there are many sites that ignore that advice and are happily allowing crawlers to access these search pages.
I wanted to look at the possible dangers of allowing this content to be indexed – what it means for your website, and how in fact it can be used to your advantage.

Before I dive into the article, you might like to hear my question being put to John Mueller at Google during his excellent weekly Webmaster Hangouts.

Video of John Mueller of Google reading/responding to my question about serving internal search results to Google

What do other sites do about serving internal search results?

I’ve researched the area of noindexing internal search for a little while now, more or less since the time that Giphy.com was supposedly penalised for this exact thing. To those who didn’t hear about them, the majority of their organic traffic was going to pages which were actually internal search results.

They famously bragged about “owning” the Happy Birthday SERP in an interview, only for them to be penalised almost overnight. They’ve since noindexed their search results and started to recover – although it looks like they may have since been affected by some more recent Google updates.

Screenshot of Giphy.com’s initial organic drop

Screenshot of Giphy.com’s initial organic drop, followed by a slow rebound, followed by a more recent drop (data taken from SEMRush)

 

One of the other main reasons I’ve been drawn to this topic recently was invoked from my own web browsing – I’d noticed many big sites on the web not noindexing their search results, and obviously doing very well from it judging from Google’s SERPS. See the example of Wayfair.com, the massively popular online furniture retailer.

Organic rankings for Wayfair.com
Organic rankings for Wayfair.com showing only the results which are based on their internal search result pages (data taken from SEMRush)

It seems to be more prevalent in some industries vs others, but the use of serving internal search results to search engines seems more common than not. I personally think it’s often because many sites don’t realise they’re doing it, or when they do realise it they’re scared to stop – because it’s bringing in plenty of traffic.
But I’d guess in the case of Wayfair.com they’ve seen the upside to doing it, so until it stops working (or they’re penalised) they’ll keep at it, and as long as they’re aware of it, and regularly monitoring it, then you may argue it’s fair enough.

Take a look at some more examples below showing sites that don’t noindex their own search results or their user-generated content.

FirstCry.com
FirstCry.com is an online shop for new mums based in India. They have 180,000 indexed pages in Google – which isn’t surprising when a huge amount of these are based on internal search queries (wondering if here baby fell on laptop when mum or dad wasn’t looking?…)

[Case Study] Increase crawl budget on strategic pages

Most of Manageo’s traffic comes from organic search. This traffic mainly relies on long-tail searches, creating the need to optimize for millions of keywords at the same time. Crawl budget quickly became an issue.

As well as ecommerce sites not making enough use of the essential noindex tag, PeoplePerHour.com the popular UK based freelancer site also have huge issues of their own. Much of their freelance job search function creates it’s own content which is happily served to search engines to consume, whilst there are also issues with the freelancers themselves and the way in which their content is getting indexed.

PeoplePerHour.com
Freelance erectile dysfunction probably isn’t the kind of quality content you want to be serving people – or search engines, as PeoplePerHour.com are currently doing

With the above “freelance search” function generating an enormous amount of poor quality content, they have a big job on their hands cleaning up the content which is already indexed in Google.

the site:search example
It’s hard to trigger these searches “naturally” (which is lucky for PPH), but again the site:search example above shows that they really should control their auto-generated site content and prevent huge amounts of it ever being indexed

unnecessary indexable content
Looking for someone to rip you off? PPH has you covered! (sorry to any of those users shown above!) This additional dynamic search function (which isn’t noindexed) creates a lot of unnecessary indexable content

I wrote in a bit more depth about SEO issues with PeoplePerHour – the main TLDR being that as well as the issues above, they also link out to a huge number of internal search results via their extensive XML sitemaps. So in my eyes they’re 100% responsible for Google indexing all this crap content (I did try to reach out to them a few times to make them aware of this but I had no reply, so I consider them fair-game) ?

Canonical tag
Canonical tag of the above page is pointing to the URL auto generated by the search, and isn’t noindexed either – which leads to all kinds of content getting indexed

 

Dangers with indexed internal search results or user-generated content

So, aside from the obvious points above, there are a number of reasons why this is usually bad practise. First of all it’s very likely that your site’s internal search results pages aren’t going to be the most optimum pages you’d want returned within a search engine.

Search pages likely have the query still visible, plus any products that match the search query, but there’s probably not much else going on on that particular page – lots of repeat content (header, footer) or otherwise termed boilerplate content.

It’s likely that you’d have far-better pages which you’d like to show the user as well as showing Google. Wayfair.com again are an example of a site with this particular problem – check out the below, including their dynamic meta title/description tags!

search query
Admittedly it’s hard to trigger this SERP in the wild without the above search query, but it’s a good example of how internal search doesn’t always work (especially in conjunction with auto-generating meta tags!)

 

If you were brave enough to click through the above SERP you’d be greeted by the following page – which (IMO) isn’t the kind of content you’d want users or search engines to discover as a home furniture provider…

the use of dynamic keyword insertion in the heading tag (H1)
You can see the use of dynamic keyword insertion in the heading tag (H1), as well as within the description text. This kind of dynamic insertion is quite old-school, but appears to be generating good organic traffic for Wayfair.com

Another risk of allowing indexable search content is that if you’ve got an ecommerce site with lots of products listed then it’s likely that the search results pages may have an enormous volume of results returnable (especially when taking into account product filtering options which create additional URLs)– or the search page could even return an infinite number of results, creating near endless loops.

To anyone familiar with crawl budgets, you’ll know that this causes big SEO headaches. Control Googlebot by noindexing the search results pages to save them wasting their crawls and burning up your web hosts valuable resources. You can obviously use a seo crawler tool like Oncrawl to find and diagnose these crawl issues simply by running a full crawl of all areas of your site.

Site analysis Oncrawl
Part of the Summary of a site analysis using the cloud based crawler Oncrawl, showing a site which has largely noindexed their content due to concerns about it’s lack of quality – better safe than sorry!

 

Hacking SERPS through internal search

Last of all, and perhaps the biggest reason avoiding indexable search pages is the possibility of Google indexing and returning what I like to call “bad SERPS”. These are results which may reflect badly upon your client’s sites, similar to the Wayfair.com example earlier.

This situation isn’t as common but it can happen – and this is a risk when your internal search pages create indexable dynamic pages based upon a user’s search. So enter a search query like “Matt loves using oncrawl” and it’ll generate an end URL like “domain.com/?query=matt-loves-using-oncrawl” which has the potential to get picked up and indexed.
This happened to Spotify recently and I was relatively surprised that it went largely unnoticed within the SEO community. Google was returning a featured snippet for “Spotify phone number” type searches, whereby people are obviously looking for a contact number for Spotify (who don’t have one setup).

So – how can Google return a featured snippet showing the contact number for Spotify, if no such info exists? Simple – spammers had searched this query within the Spotify website, and had then aimed a number of links to this internal search page (which is my guess anyway).

Google somehow detected that this was likely a good match for users search queries and so started to return that specific URL as a featured snippet. The phone number itself was probably taking people to a premium rate line instead of anyone at Spotify HQ.

It’s quite rare for internal searches to create their own pages dynamically in this fashion, but if they do it’s really critical to control those types of pages – it can create a huge amount of unwanted content, diluting overall website quality and wasting crawl budgets as mentioned earlier. Or alternatively it can lead to a bit of a PR nightmare, as with the above Spotify situation!

Improving your internal search strategy

For those sites that do rely on serving many search results to their users – perhaps large ecommerce sites where the search function is used heavily, a photo-sharing platform, or a job platform where jobs update on a very regular basis, thankfully there are some things which can be done to improve the search experience; both for people and search crawlers.
One example of a site that’s doing a good job to utilise their search function would be Airbnb. They do a number of great things to control the use of search on the site – where the search function is a huge part of their booking process.

Airbnb internal search

It’s really difficult to search for something on Airbnb which doesn’t exist, as it auto-prompts you to select a location as you type – as you can see when I search for “Bournemouth”. They’re speeding up the search process whilst leading you to where they think you want to go. This form of “search gatekeeping” improves UI and gives a better user experience – both fields becoming more essential to a good overall SEO strategy.

If I was to search from Google for “Airbnb Bournemouth” you can clearly see the top 2 organic results returned are from Airbnb.co.uk and Airbnb.com (side note – using ccTLD’s is a great tactic to take more SERP coverage for worldwide brands) which point to what looks like an internal search results page for Bournemouth….

Searching for Airbnb Bournemouth
Searching for Airbnb Bournemouth – note the “Top 20 Places to stay…” and the use of the current year – 2018 within the title tag to improve CTR via increasing its relevancy

But as you’ll see, it may well appear to be a typical internal search result, but actually it’s a bit better than your typical auto-generated content, as highlighted with Wayfair.com earlier.

The landing page for the previous search query
The landing page for the previous search query – well optimised for this particular search query, and with plenty of relevant listings in Bournemouth (over 20 – the title tag shown earlier was just clickbait!)

I didn’t have the patience to sift through the 100+ index XML sitemaps I found when checking out Airbnb’s expectedly trendy robots.txt file, but I’m pretty confident if you do you’ll find some pointing to these internal search pages. Linking to them from an XML sitemap is a clean, reliable and quick way for a massive site like Airbnb to get them picked up and indexed by Google and others.

Airbnb’s trendy robots.txt file
Airbnb’s trendy robots.txt file, as you’d expect from a company like theirs…

Serving internal search results as a Jobs site

Another site which does a great job dealing with internal search results is TotalJobs.com. Not only were they ranking #1 in Google when I randomly searched for “SEO Jobs”, but I can see that they’re tactically making use of the search function and noindex feature.
For example, whilst this search results page for “SEO Jobs” isn’t noindexed – as you’d expect as it’s a key job topic, a random search I made for nonsense terms which shouldn’t be indexed does indeed have a noindex tag added. A+ to the Total Jobs SEO team!

TotalJobs.com
Good job TotalJobs.com – intelligent use of the noindex tag based on a low (hopefully 0!) search volume job query I’d entered as a test…

Serving Crawlers with the right internal search content

Other things which can be done to aid search crawlers to find and index the internal search pages which matter to your website include:

Pagination – which means applying the correct markup (prev, next, etc) to indicate that content spans one or more pages, or exists as part of a series. This markup code is not visible to an end user but it does help search engines to understand how content is structured and to learn more about the site architecture and content context.

Breadcrumbs – these can really aid site navigation for users whilst also helping search crawlers understand priority sections of a site. They appear below the header content often like so:

Home > Category name > Sub-category name > Product.

• Categorisation – breaking up of key sections of the site based upon categories. This taxonomy form helps users navigate the site as well as aiding search crawlers. The breaking-up of your content by category is a vital part of your internal site architecture.

Faceted navigation – this is a tough nut to crack, but ultimately (if used correctly) it can help search crawlers and users to navigate content (usually products) listed on your site by their specific attributes.

For example on a mobile phone ecommerce store you may include Android as a product attribute, as well as price, screen-size, and so on. These can all create a number of URL variations which all need careful consideration as part of your overall site structure – bearing in mind crawl budgets and duplicate content concerns.

You want to (ideally) consolidate URL’s by specific products/categories, instead of creating ones which vary upon various attributes.

Argos.co.uk big internal structure issues
Argos.co.uk is an example of a site which has big internal structure issues – they’re lacking a clear site structure and improving categorisation and faceted navigation would likely see vast SEO improvements.

Does Google like serving internal search results in 2018?

I believe that recently with some of the bigger search algorithm updates Google has employed more use of advanced machine learning at a huge scale which can easily understand the way in which search functionality works on most sites, leading to a rise in the number of sites returned as Google results (which consist of internal search pages).
In a nutshell, Google is now happy to return internal search results as a result within their product, but with the premise that if they see low click dwell time, or high “pogo-ing” back to the search results, Google will show that particular page fewer times within their results. This is my own theory at least, based upon what I’ve noticed within the SERPS in the past few months.

The above hypothesis makes sense really – if the results returned by Google are matching the users query and they satisfy their search intent element, then everyone’s happy.
Because of the confusion amongst the SEO community, I felt the matter of serving internal search results (and the recent positive rankings obtained by sites that do make use of them) deserved more research. This post only really scratched the surface on the subject – there’s plenty more to be covered here!

More Fun with internal search

Let’s see if the SEO community can take this further – what crazy internal search results or user-generated content can you find out in the wild? Let us know with a screenshot or a comment below!

Matt Tutt is a technical SEO specialist from the UK working remotely with a number of brands around the world. He loves diagnosing technical SEO issues on websites, and helping to implement solutions to improve site indexing. You can learn more about Matt's services and his life as a remote SEO consultant here on his website or on his Twitter
Related subjects: