Crawling, indexing and Python: all you need to know

May 31, 2021 - 21  min reading time - by Daniel Heredia
Home > Technical SEO > Crawling, indexing and Python

I would like to get this article started with a very simple sort of equation: if your pages are not crawled, they will never be indexed and therefore, your SEO performance will always suffer (and stink).

As a consequence of that, SEOs need to endeavour to find the best way to make their websites crawlable and provide Google with their most important pages to get them indexed and start acquiring traffic through them.

Thankfully, we have many resources that can help us to improve our website crawlability such as Screaming Frog, Oncrawl or Python. I will show you how Python can help you out to analyze and improve your crawling friendliness and indexing indicators. Most of the time, these sorts of improvements also drive to better rankings, higher visibility in the SERPs and eventually, more users landing onto your website.

1. Requesting indexing with Python

1.1. For Google

Requesting indexing for Google can be done in several ways, although sadly I am not very convinced by any of them. I will walk you through three different options with their pros and cons:

  1. Selenium and Google Search Console: from my point of view and after testing it and the rest of the options, this is the most effective solution. However, after a number of attempts it is possible that there will be a captcha pop-up that will break it.
  2. Pinging a sitemap: it definitely helps to make the sitemaps crawled as requested, but not specific URLs, for instance in the case that new pages have been added to the website.
  3. Google Indexing API: it is not very reliable except for broadcasters and job platform websites. It helps to increase crawling rates but not to index specific URLs.

After this quick overview about each method, let’s dive into them one by one.

 

1.1.1. Selenium and Google Search Console

Essentially, what we will do in this first solution is to access Google Search Console from a browser with Selenium and replicate the same process that we would follow manually to submit many URLs for indexing with Google Search Console, but in an automated way.

Note: Don’t overuse this method and only submit a page for indexing if its content has been updated or if the page is completely new.

The trick to be able to log into Google Search Console with Selenium is by accessing the OUATH Playground first as I explained in this article about how to automate the GSC crawl stats report downloading.

#We import these modules
import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys
 
#We install our Selenium Driver
driver = webdriver.Chrome(ChromeDriverManager().install())
 
#We access the OUATH playground account to log into Google Services
driver.get('https://accounts.google.com/o/oauth2/v2/auth/oauthchooseaccount?redirect_uri=https%3A%2F%2Fdevelopers.google.com%2Foauthplayground&prompt=consent&response_type=code&client_id=407408718192.apps.googleusercontent.com&scope=email&access_type=offline&flowName=GeneralOAuthFlow')
 
#We wait a bit to make sure rendering is complete before selecting elements with Xpath and introducing our email address.
time.sleep(10)
form1=driver.find_element_by_xpath('//*[@id="identifierId"]')
form1.send_keys("<your email address>")
form1.send_keys(Keys.ENTER)
 
#Same here, we wait a bit and then we introduce our password.
time.sleep(10)
form2=driver.find_element_by_xpath('//*[@id="password"]/div[1]/div/div[1]/input')
form2.send_keys("<your password>")
form2.send_keys(Keys.ENTER)

 

After that, we can access our Google Search Console URL:

driver.get('https://search.google.com/search-console?resource_id=your_domain”')

time.sleep(5)
box=driver.find_element_by_xpath('/html/body/div[7]/div[2]/header/div[2]/div[2]/div[2]/form/div/div/div/div/div/div[1]/input[2]')
box.send_keys("your_URL")
box.send_keys(Keys.ENTER)
time.sleep(5)

indexation = driver.find_element_by_xpath("/html/body/div[7]/c-wiz[2]/div/div[3]/span/div/div[2]/span/div[2]/div/div/div[1]/span/div[2]/div/c-wiz[2]/div[3]/span/div/span/span/div/span/span[1]")
indexation.click()

time.sleep(120)

 

Unfortunately, as explained in the introduction, it seems that after a number of requests it starts requiring a puzzle captcha to proceed with the indexing request. Since automated method can’t solve the captcha, this is something that handicaps this solution.

 

1.1.2. Pinging a sitemap

Sitemap URLs can be submitted to Google with the ping method. Basically, you would only need to make a request to the following endpoint introducing your sitemap URL as a parameter:

http://www.google.com/ping?sitemap=URL/of/file

This can be automated very easily with Python and requests as I explained in this article.

import urllib.request

url = "http://www.google.com/ping?sitemap=https://www.example.com/sitemap.xml"
response = urllib.request.urlopen(url)

 

1.1.3. Google Indexing API

The Google Indexing API can be a good solution to improve your crawling rates but usually it is not a very effective method to get your content indexed as it is only supposed to be used if your website has either JobPosting or BroadcastEvent embedded in a VideoObject. However, if you would like to give it a try and test it yourself, you can follow the next steps.

First of all, to get started with this API you need to go to Google Cloud Console, create a project and a service account credential. After that, you will need to enable the Indexing API from the Library and add the email account that is given with the service account credentials as a property owner on Google Search Console. You might need to use the old version of Google Search Console to be able to add this email address as a property owner.

Once you follow the previous steps, you will be able to start asking indexing and deindexing with this API by using the next piece of code:

from oauth2client.service_account import ServiceAccountCredentials
import httplib2

SCOPES = [ "https://www.googleapis.com/auth/indexing" ]
ENDPOINT = "https://indexing.googleapis.com/v3/urlNotifications:publish"
client_secrets = "path_to_your_credentials.json"

credentials = ServiceAccountCredentials.from_json_keyfile_name(client_secrets, scopes=SCOPES)

if credentials is None or credentials.invalid:
credentials = tools.run_flow(flow, storage)
http = credentials.authorize(httplib2.Http())

list_urls = ["https://www.example.com", "https://www.example.com/test2/"]

for iteration in range (len(list_urls)):

content = '''{
'url': "'''+str(list_urls[iteration])+'''",
'type': "URL_UPDATED"
}'''

response, content = http.request(ENDPOINT, method="POST", body=content)
print(response)
print(content)

 

If you would like to ask for deindexing, you would need to change the request type from “URL_UPDATED” to “URL_DELETED”. The previous piece of code will print the responses from the API with the notification times and their statuses. If the status is 200, then the request will have been made successfully.

 

1.2. For Bing

Very often when we speak about SEO we only think of Google, but we cannot forget that in some markets there are other predominant search engines and/or other search engines that have a respectable market share like Bing.

It is important to mention from the beginning that Bing already has a very convenient feature on Bing Webmaster Tools that enables you to request the submission of up to 10,000 URLs per day in most of the cases. Sometimes, your daily quota might be lower than 10,000 URLs but you have the option of requesting a quota increment if you think that you would need a larger quota to meet your needs. You can read more about this on this page.

This feature is indeed very convenient for bulk URLs submissions as you will only need to introduce your URLs in different lines in the URL submission tool from the normal interface of Bing Webmaster Tools.

 

1.2.1. Bing Indexing API

Bing Indexing API can be used with an API key that needs to be introduced as a parameter. This API key can be obtained on Bing Webmaster Tools, going to the API access section and after that, generating the API key.

Once the API key has been gotten, we can play around with the API with the following piece of code (you would only need to add your API key and your site URL):

import requests 

list_urls = ["https://www.example.com", "https://www.example/test2/"]

for y in list_urls:

    url = 'https://ssl.bing.com/webmaster/api.svc/json/SubmitUrlbatch?apikey=yourapikey'
    myobj = '{"siteUrl":"https://www.example.com", "urlList":["'+ str(y) +'"]}'
    headers = {'Content-type': 'application/json; charset=utf-8'}


    x = requests.post(url, data=myobj, headers=headers)
    print(str(y) + ": " + str(x))

 

This will print the URL and its response code on each iteration. In contrast to Google Indexing API, this API can be used for any sort of website.

[Case Study] Increase visibility by improving website crawlability for Googlebot

Paris Match has faced a clear problematic: deeply auditing its site, identifying its strengths and weaknesses, determining its priorities and fixing the blocking factors for Google’s crawl. The SEO issues of Paris Match’s website are the commons ones of news sites.

2. Sitemaps analysis, creation and upload

As we all know, sitemaps are very useful elements to provide search engine bots with the URLs that we would like them to crawl. In order to let search engine bots know where our sitemaps are, they are meant to be uploaded to Google Search Console and Bing Webmaster Tools and included onto the robots.txt file for the rest of the bots.

With Python we can work on mainly three different aspects related to the sitemaps: their analysis, creation and uploading and deletion from Google Search Console.

2.1. Sitemap importation and analysis with Python

Advertools is a great library created by Elias Dabbas that can be used for sitemap importation as well as many other SEO tasks. You will be able to import sitemaps into Dataframes just by using:

sitemap_to_df('https://example.com/robots.txt', recursive=False)

This library supports regular XML sitemaps, news sitemaps and video sitemaps.

On the other hand, if you are only interested in importing the URLs from the sitemap, you can also use the library requests and BeautifulSoup.

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.example.com/your_sitemap.xml")
xml = r.text

soup = BeautifulSoup(xml)
urls = soup.find_all("loc")
urls = [[x.text] for x in urls]

 

Once the sitemap has been imported, you can play around with the extracted URLs and perform a content analysis as explained by Koray Tuğberk in this article.

2.2. Sitemaps creation with Python

You can also utilize Python to create sitemaps.xml from a list of URLs as JC Chouinard explained in this article. This can be especially useful for very dynamic websites whose URLs are rapidly changing and together with the ping method that was explained above, it can be a great solution to provide Google with the new URLs and get them crawled and indexed fast.

Recently, Greg Bernhardt also created an APP with Streamlit and Python to generate sitemaps.

2.3. Uploading and deleting sitemaps from Google Search Console

Google Search Console has an API that can be used mainly in two different ways: to extract data about the web performance and handle sitemaps. In this post, we will focus on the option of uploading and deleting sitemaps.

First, it is important to create or use an existing project from Google Cloud Console to obtain an OUATH credential and enable the Google Search Console service. JC Chouinard explains very well the steps that you need to follow to access Google Search Console API with Python and how to make your first request in this article. Basically, we can completely make use of his code but only by introducing a change, in the scopes we will add “https://www.googleapis.com/auth/webmasters” instead of “https://www.googleapis.com/auth/webmasters.readonly” as we will use the API not only to read but to upload and delete sitemaps.

Once, we connect with the API, we can start playing with it and list all the sitemaps from our Google Search Console properties with the next piece of code:

for site_url in verified_sites_urls:
  print (site_url)
  # Retrieve list of sitemaps submitted
  sitemaps = webmasters_service.sitemaps().list(siteUrl=site_url).execute()
  if 'sitemap' in sitemaps:
    sitemap_urls = [s['path'] for s in sitemaps['sitemap']]
    print ("  " + "\n  ".join(sitemap_urls))

 

When it comes to specific sitemaps, we can do three tasks that we will elaborate in the next sections: uploading, deleting and requesting information.

 

2.3.1. Uploading a sitemap

To upload a sitemap with Python we only need to specify the site URL and the sitemap path and run this piece of code:

WEBSITE = 'yourGSCproperty’
SITEMAP_PATH = 'https://www.example.com/page-sitemap.xml'

webmasters_service.sitemaps().submit(siteUrl=WEBSITE, feedpath=SITEMAP_PATH).execute()

 

2.3.2. Deleting a sitemap

The other side of the coin is when we would like to delete a sitemap. We can also delete sitemaps from Google Search Console with Python using the “delete” method instead of “submit”.

WEBSITE = 'yourGSCproperty’
SITEMAP_PATH = 'https://www.example.com/page-sitemap.xml'

webmasters_service.sitemaps().delete(siteUrl=WEBSITE, feedpath=SITEMAP_PATH).execute()

 

2.3.3. Requesting information from the sitemaps

Finally, we can also request information from the sitemap by using the method “get”.

WEBSITE = 'yourGSCproperty’
SITEMAP_PATH = 'https://www.example.com/page-sitemap.xml'

webmasters_service.sitemaps().get(siteUrl=WEBSITE, feedpath=SITEMAP_PATH).execute()

 

This will return a response in JSON format like:

3. Internal linking analysis and opportunities

Having a proper internal linking structure is very helpful to facilitate search engine bots the crawling of your website. Some of the main issues that I have come across by auditing a number of websites with very sophisticated technical setups are:

  1. Links introduced with on-click events: in short, Googlebot does not click on buttons, so if your links are inserted with an on-click event, Googlebot will not be able to follow them.
  2. Client-side rendered links: despite that Googlebot and other search engines are becoming much better at executing JavaScript, it is still something quite challenging for them, so it is much better to server-side render these links and serve them into the raw HTML to search engine bots than expecting them to execute JavaScript scripts.
  3. Login and/or age gate pop-ups: login pop-ups and age gates can prevent search engine bots from crawling the content that is behind these “obstacles”.
  4. Nofollow attribute overuse: using many nofollow attributes pointing to valuable internal pages will prevent search engine bots from crawling them.
  5. Noindex and follow: technically the combination of noindex and follow directives should let search engine bots crawl the links that are on that page. However, it seems that Googlebot stops crawling those pages with noindex directives after a while.

With Python we can analyse our internal linking structure and find new internal linking opportunities in bulk mode.

3.1. Internal Linking Analysis with Python

Some months ago I wrote an article about how to use Python and the library Networkx to create graphs to display the internal linking structure in a very visual way:

This is something very similar to what you can obtain from Screaming Frog, but the advantage of using Python for this sort of analyses is that basically you can choose the data that you would like to include in these graphs and control most of the graph elements such as colors, node sizes or even the pages that you would like to add.

3.2. Finding new internal linking opportunities with Python

Apart from analyzing site structures, you can also make use of Python to find new internal linking opportunities by providing a number of keywords and URLs and iterating over those URLs searching for the provided terms in their pieces of content.

This is something that can work very well with Semrush or Ahrefs exports in order to find powerful contextual internal links from some pages that are already ranking for keywords and hence, that already have some type of authority.

You can read more about this method over here.

4. Website Speed, 5xx and soft error pages

As stated by Google on this page about what crawl budget means for Google, making your site faster improves the user experience and increases the crawl rate. On the other hand, there are also other factors that might affect crawl budget such as soft error pages, low quality content and on-site duplicate content.

4.1. Page speed and Python

4.2.1 Analyzing your website speed with Python

Page Speed Insights API is super useful to analyze how your website is performing in terms of page speed and to get lots of data about many different page speed metrics (almost 50) plus Core Web Vitals.

Working with Page Speed Insights with Python is very straightforward, only an API key and requests are needed to make use of it. For example:

import urllib.request, json
url = "https://www.googleapis.com/pagespeedonline/v5/runPagespeed?url=your_URL&strategy=mobile&locale=en&key=yourAPIKey"
#Note that you can insert your URL with the parameter URL and you can also modify the device parameter if you would like to get the data for desktop.
response = urllib.request.urlopen(url)
data = json.loads(response.read()) 

 

In addition, you can also forecast with Python and Lighthouse Scoring calculator how much your overall performance score would improve in case of making the requested changes to enhance your page speed as explained in this article.

 

4.2.2 Image Optimization and Resizing with Python

Related to the website speed, Python can also be used to optimize, compress and resize images as explained in these articles written by Koray Tuğberk and Greg Bernhardt:

  1. Automate Image Compression with Python over FTP.
  2. Resize Images with Python in Bulk.
  3. Optimize images via Python for SEO and UX.

4.2. 5xx and other response code errors extraction with Python

5xx response code errors might be indicative that your server is not fast enough to cope with all the requests that it is receiving. This can have a very negative impact on your crawl rate and it can also damage the user experience.

In order to make sure that your website is performing as expected, you can automate the crawl stats report downloading with Python and Selenium and you can keep a close eye on your log files.

4.3. Soft error pages extraction with Python

Recently, Jose Luis Hernando published an article in honour of Hamlet Batista about how you can automate the coverage report extraction with Node.js. This can be an amazing solution to extract the soft error pages and even the 5xx response errors that might negatively impact your crawl rate.

We can also replicate this same process with Python in order to compile in only one Excel tab all the URLs that are provided by Google Search Console as erroneous, valid with warnings, valid and excluded.

First, we need to log into Google Search Console as explained previously in this article with Python with Selenium. After that, we will select all the URL status boxes, we will add up to 100 rows per page and we will start iterating over all the types of URLs reported by GSC and download every single Excel file.

import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get('https://accounts.google.com/o/oauth2/v2/auth/oauthchooseaccount?redirect_uri=https%3A%2F%2Fdevelopers.google.com%2Foauthplayground&prompt=consent&response_type=code&client_id=407408718192.apps.googleusercontent.com&scope=email&access_type=offline&flowName=GeneralOAuthFlow')

time.sleep(5)
searchBox=driver.find_element_by_xpath('//*[@id="identifierId"]')
searchBox.send_keys("<youremailaddress>")
searchBox.send_keys(Keys.ENTER)
time.sleep(5)
searchBox=driver.find_element_by_xpath('//*[@id="password"]/div[1]/div/div[1]/input')
searchBox.send_keys("<yourpassword>")
searchBox.send_keys(Keys.ENTER)
time.sleep(5)

yourdomain = str(input("Insert here your http property or domain. If it is a domain include: 'sc-domain':"))

driver.get('https://search.google.com/search-console/index?resource_id=' + yourdomain)
elem3 = driver.find_element_by_xpath("/html/body/div[*]/c-wiz/span/c-wiz/c-wiz/div/div[3]/div[2]/div/div/div/span/c-wiz/div/div[1]/div/div[4]/div[1]/span/span/div/div[1]/span[2]")
elem3.click()
elem6 = driver.find_element_by_xpath("/html/body/div[*]/c-wiz/span/c-wiz/c-wiz/div/div[3]/div[2]/div/div/div/span/c-wiz/div/div[1]/div/div[3]/div[1]/span/span/div/div[1]/span[2]")
elem6.click()
elem7 = driver.find_element_by_xpath("/html/body/div[*]/c-wiz/span/c-wiz/c-wiz/div/div[3]/div[2]/div/div/div/span/c-wiz/div/div[1]/div/div[2]/div[1]/span/span/div/div[1]/span[2]")
elem7.click()
elem4 = driver.find_element_by_xpath("/html/body/div[*]/c-wiz/span/c-wiz/c-wiz/div/div[3]/div[2]/div/div/c-wiz/div/span/div/div[2]/div/div/span[1]/span/div[2]/div[1]/div[1]/div[2]/span")
elem4.click()
time.sleep(2)
elem5 = driver.find_element_by_xpath("/html/body/div[*]/c-wiz/span/c-wiz/c-wiz/div/div[3]/div[2]/div/div/c-wiz/div/span/div/div[2]/div/div/span[1]/span/div[2]/div[2]/div[5]/span")
elem5.click()
time.sleep(5)


problems = driver.find_elements_by_xpath("/html/body/div[*]/c-wiz/span/c-wiz/c-wiz/div/div[3]/div[2]/div/div/c-wiz/div/span/div/div[1]/div[2]/table/tbody/*/td[2]/span/span")


list_problems = []
for x in problems:
    list_problems.append(x.text)

counter = 1
for x in list_problems:
    
    print("Extracting the report for " + x)
    elem2 = driver.find_element_by_xpath("/html/body/div[*]/c-wiz/span/c-wiz/c-wiz/div/div[3]/div[2]/div/div/c-wiz/div/span/div/div[1]/div[2]/table/tbody/tr[" + str(counter) + "]/td[2]/span/span")
    elem2.click()
    time.sleep(5)
    downloading1 = driver.find_element_by_xpath("/html/body/div[*]/c-wiz[2]/c-wiz/div/div[1]/div[1]/div[2]/div")
    downloading1.click()
    time.sleep(2)
    downloading2 = driver.find_element_by_xpath("/html/body/div[*]/c-wiz[2]/div/div/div/span[2]")
    downloading2.click()
    time.sleep(5)
    
    if counter != len(list_problems):
        
        driver.get('https://search.google.com/search-console/index?resource_id=' + yourdomain)
        time.sleep(3)
        elem3 = driver.find_element_by_xpath("/html/body/div[*]/c-wiz/span/c-wiz/c-wiz/div/div[3]/div[2]/div/div/div/span/c-wiz/div/div[1]/div/div[4]/div[1]/span/span/div/div[1]/span[2]")
        elem3.click()
        elem6 = driver.find_element_by_xpath("/html/body/div[*]/c-wiz/span/c-wiz/c-wiz/div/div[3]/div[2]/div/div/div/span/c-wiz/div/div[1]/div/div[3]/div[1]/span/span/div/div[1]/span[2]")
        elem6.click()
        elem7 = driver.find_element_by_xpath("/html/body/div[*]/c-wiz/span/c-wiz/c-wiz/div/div[3]/div[2]/div/div/div/span/c-wiz/div/div[1]/div/div[2]/div[1]/span/span/div/div[1]/span[2]")
        elem7.click()
        elem4 = driver.find_element_by_xpath("/html/body/div[*]/c-wiz/span/c-wiz/c-wiz/div/div[3]/div[2]/div/div/c-wiz/div/span/div/div[2]/div/div/span[1]/span/div[2]/div[1]/div[1]/div[2]/span")
        elem4.click()
        time.sleep(2)
        elem5 = driver.find_element_by_xpath("/html/body/div[*]/c-wiz/span/c-wiz/c-wiz/div/div[3]/div[2]/div/div/c-wiz/div/span/div/div[2]/div/div/span[1]/span/div[2]/div[2]/div[5]/span")
        elem5.click()
        time.sleep(2)
        counter = counter + 1
        
    else:
        driver.close()

 

You can watch this explanatory video:

Once we have downloaded all the Excel files, we can put all of them together with Pandas:

import pandas as pd
from datetime import datetime

today = str(datetime.date(datetime.now()))

for x in range(len(list_problems)):
    print(x)
    if x == 0:
        df1 = pd.read_excel(yourdomain.replace("sc-domain:","").replace("/","_").replace(":","_") + "-Coverage-Drilldown-" + today + ".xlsx", 'Tabla')
        listvalues = [list_problems[x] for i in range(len(df1["URL"]))]
        df1['Type'] = listvalues
        list_results = df1.values.tolist()
        
    else:
        df2 = pd.read_excel(yourdomain.replace("sc-domain:","").replace("/","_").replace(":","_") + "-Coverage-Drilldown-" + today + " (" + str(x) + ").xlsx", 'Tabla')
        listvalues = [list_problems[x] for i in range(len(df2["URL"]))]
        df2['Type'] = listvalues
        list_results = list_results + df2.values.tolist()
        
        
df = pd.DataFrame(list_results, columns= ["URL","TimeStamp", "Type"])
df.to_csv('<filename>.csv', header=True, index=False, encoding = "utf-8")

 

The final output looks like:

4.4. Log file analysis with Python

As well as the data which is available on the crawl stats report from Google Search Console, you can also analyze your own files by using Python to get much more information about how the search engine bots are crawling your website. If you aren’t already using a log analyzer for SEO, you can have a read at this article from SEO Garden where log analysis with Python is explained.

[Ebook] Four Use Cases to Leverage SEO Log Analysis

Learn how log files, as the single reliable reference for website traffic, can provide easy answers to tough SEO questions.

5. Final conclusions

We have seen that Python can be a great asset to analyze and improve the crawling and indexing of our websites in many different ways. We’ve also seen how to make life much easier by automating most of the tedious and manual tasks that would require thousands of hours of your time.

I must say that unfortunately I am not fully convinced by the solutions that are offered at the moment by Google to request indexing for a large number of URLs, although I can understand to some extent its fear of offering a better solution: many SEOs might tend to overuse it.

In contrast to that, there is Bing, which offers exceptional and convenient solutions to request URL indexing via API and even through the normal interface on Bing Webmaster Tools.

Due to the fact that Google indexing API has room for improvement, other elements like having an accessible and updated sitemap in place, your internal linking, your page speed, your soft error pages and your duplicate and low quality content become even more important to ensure that your website is properly crawled and your most important pages are indexed.

Daniel Heredia See all their articles
Daniel has a very analytical mindset and his work methodology is based on: action, evaluation and decision. At the moment, he is based in Malta, and is working for Casumo, one big Egaming company as an SEO Marketer.
Related subjects: