The webinar Rankbrain, AI, machine learning and the future of search is a part of the SEO in Orbit series, and aired on June 19th, 2019. In this episode, Bill Slawski leverages his knowledge of Google patents and the workings of search to break down probable search algorithms used today and postulate what it might look like under the hood of a future version of Google. Join us while we explore the future of technical SEO.
SEO in Orbit is the first webinar series sending SEO into space. Throughout the series, we discussed the present and the future of technical SEO with some of the finest SEO specialists and sent their top tips into space on June 27th, 2019.
Watch the replay here:
Presenting Bill Slawski
A self-taught search engine patent expert, Bill Slawski is the Director of SEO reach at Go Fish Digital and a blogger at SEO by the Sea. In Bill’s own words: “I am not a computer Scientist, and I am not a mathematician. I have an undergraduate degree in English and a Jurisdoctor in Law degree. I have been reading patents from the search engines since around 2005, to learn about what they have to say about search,searchers, and the Web. Many of these patents cover algorithms that aim to address particular problems, and I have found many helpful when it comes to performing SEO.”
This episode was hosted by François Goube, serial entrepreneur, and the Co-Founder and CEO of Oncrawl. He has founded several companies and is actively involved in the startup ecosystem. Passionate about semantic analysis and search engines, he loves to analyze scientific Google publications and is a regular speaker at SEO conferences.
What are AI and machine learning?
There are a lot of definitions of AI.
A lot of Google’s work focuses on neural networks, which leads towards how machine learning works. It uses a set of data that represents the ideal dataset, marked up to stress certain features about it, that is used to train classifiers. These are then turned loose on other sets of data to analyze and classify the new information based on what they learned from the sample set. That’s machine learning.
Areas covered by AI
– Natural language
AI can cover different areas, like better understanding natural language. There are a number of techniques involved, and many of the things coming up from Google illustrate what’s involved in natural language analysis.
– Question answering
A recent patent (link) tries to fill in blanks in question answering schemas.
It explains how Google might use a knowledge graph to understand what the answer to a question might be. For example, if there is information missing or incorrect data for entities, Google might try to estimate the answer based on information associated with related facts.
What’s interesting about this patent is not that Google is using estimation to answer questions, but that they’re providing the explanations for their estimates.
– Mimicking human thought (neural networks)
Machine learning is based upon AI, on mimicking the way human thought might work. Machine learning networks are called neural networks because they’re built to attempt to replicate the way neurons in a brain work.
– Relation to Hummingbird and word context
Both Rankbrain and Hummingbird are query rewriting approaches. Hummingbird tried to better understand the context of a query by looking at all of the words in a query. Previously, Google would only look at words next to one another to understand context; Hummingbird looks beyond the words immediately next to each other. It might even take into account full sentences in conversational queries. Hummingbird tried to use all of the words in the query together to understand the context.
– Query-rewriting in Rankbrain using word embedding approach
Unlike Hummingbird, Rankbrain uses a word embedding approach. It examines a short textual passage and is able to determine if there are words that are missing. It does this by training on large sets of data (200 billion words).
– Finding missing words in query
For example, the query “New York Times puzzle” can be correctly interpreted as missing the word “crossword”. Rankbrain adds the missing word to the query and returns results for New York Times crossword puzzle to the searcher, since that is probably what they want.
– Can you optimize for Rankbrain?
It’s important to note that you can’t optimize pages for Rankbrain. Some SEOs have written articles saying that you can. However, from everything Bill has seen about the algorithm, it suggests that this is a query rewriting process, not something that affects the evaluation of a page.
Additional Google algorithms using machine learning
Google does not have a single “algorithm” that drives the search engine. It has a lot of different algorithms that contribute to how it works. Rankbrain is one of many.
– Using quality scores within categories
This might mean, for example, that when Google determines that there are a lot of informational type results for a given query, instead of ranking pages based on information retrieval score or authority ratings like PageRank, they might consider categories. From there, they might give Quality Scores within website categories. This will provide a more diverse set of results and ensures that higher quality results can move more quickly to the top of the results.
– Page popularity for navigational results
This type of ranking algorithm also favors pages that are more popular (pages that people tend to go to), particularly for navigational type results. When searchers already know that the page is something that they want to see, the page will tend to rank highly in category Quality Score paradigms.
– Influence of SERP CTR
Category Quality Scores also suggest that pages that are often selected in the search results are also high quality pages and would also rank highly under this category quality approach.
However, though a category quality score approach is definitely machine learning, it’s not Rankbrain.
Rankbrain for meeting situational needs of searchers
Rankbrain is trying to understand what may be missing in a query. The most important aspect of Rankbrain is that it attempts to meet the situation needs of searchers: what did this person really mean when they typed the query into the box?
Past keyword queries vs current spoken and conversational queries
If we’re moving towards spoken and conversation-type queries, there will be more words involved than the keyword approach that was used in the past.
As a searcher, you’re trying to guess what words you need to use to find the information you need. And you shouldn’t need to make this sort of guess. If you ask for what you want, Google should be able to analyze it and determine what you probably meant. This is the role of Rankbrain.
Natural language processing approaches
One of the things we’re seeing is Google paying a lot more attention to natural language processing. We’re seeing natural language processing approaches appear.
– Neural Matching
Danny Sullivan tweeted a bit about something he called neural matching.
Last few months, Google has been using neural matching, –AI method to better connect words to concepts. Super synonyms, in a way, and impacting 30% of queries. Don’t know what “soap opera effect” is to search for it? We can better figure it out. pic.twitter.com/Qrwp5hKFNz
— Danny Sullivan (@dannysullivan) September 24, 2018
He said this is a means of better understanding words on pages and the meaning of those words in context. He provided some examples of how one word might mean three or four different things depending on how it’s positioned within a sentence.
– Word Embedding
Google has been releasing patents about using a word embedding type approach (like they used in Rankbrain in order to understand those short textual queries) for longer amounts of text, like web pages.
– Semantic Frames
A semantic frame is when you use language ideal for a certain situation. In each situation, there’s certain language that is used. For example, points in the context of mortgage or real estate purchase don’t have the same meaning as points in dice or board games.
If you understand the framework, you can better understand the context of words on a page.
This can also help differentiate between words where the meaning itself differs from situation to situation. “Horse”, for example, does not mean the same thing to an equestrian and to a carpenter. Other patents have also explored additional methods of understanding contextual differences in meaning.
Using machine learning to identify authors based on writing styles
It’s quite easy for a machine to identify the writing style of an individual. There is a parallel between this and thematic classifications of content due to standardized styles in industries such as real estate, sports, etc.
As an English student, Bill analyzed literature and looked at the different ways authors expressed themselves, and why.
– Author scores patent using citation frequency
Google does have a patent concerning author scores. To score authors, one of the factors taken into consideration is how often they are cited by other writers.
– Google Books N-Gram viewer
Google does a lot of work with language models. They’ve scanned a large number of books. The N-Gram viewer allows you to see how the popularity of a phrase evolves over the years.
– Quality Score patent by N. Panda using language models
The Quality Score patent by N. Panda talks about using N-grams and building language models to understand the quality of web pages based on how they compare to other language models.
This is a great example of machine learning in search engine technology. We have a dataset of previously scored pages, and we’re comparing new pages to those based on the data from the original sample set. Since this is used to determine quality, pages that contain characteristics of well-written pages from the original set will get a higher score.
This type of language model can also be used to understand the writing style of different authors.
Future machine learning with structured data
Machine learning is also evident in how Google manages entities, in translation, and in the appearance of what Cindy Krum has named Fraggles.
– Answer passages and reinforcing textual content
There’s another patent that talks about answer passages, in which Google proposes a mechanism to use textual passages found on webpages to provide answers to questions. This has recently been updated to look not only at textual passages but also at structured data that reinforces the text.
– Fact checking and consistency
Using Schema provides redundancy in information. This gives Google a means of checking the consistency of informational facts on a webpage by comparing the textual information with the information provided in the structured mark-up.
This is the same thing that happens on Google maps where Google looks at name, address and phone number.
Consistency provides a level of confidence that the answer may be more likely to be correct.
– FAQ pages and how-to pages
As Google introduces FAQ-page and How-to Schema support, we see them moving towards means of getting site owners to build in Schema that reflect what they might put in the text on a web page.
Strategies to understand context on web page
Google has taken other steps to try to understand the content better within web pages. Here are a few:
– Use of knowledge bases and context terms
Google patents have indicated that they might look at knowledge bases and might collect definitions of context terms from those knowledge bases. They might then look for the presence of these context terms on a web page to help determine which context-dependent meaning of a word is most probable.
So a page about a horse (an animal) might contain words like “saddle”, whereas pages about other types of horses might contain words like “carpentry.”
– Phrase-based indexing
Another approach to semantic learning for understanding topics on pages dates from 2004 or so. Phrase-based indexing is not only old, but also the subject of at least 20 patents and has been updated and amended several times. All of this indicates to Bill that phrase-based indexing is something that carries a lot of importance in Google’s algorithms.
– Building inverted index of topic predictive phrases
One of the patents associated with phrase-based indexing describes building an inverted index of phrases that appear on pages and that are predictive of topics. An example would be phrases like “President of the United States”, “Secretary of State” or “Rose Garden interview” that are predictive of a semantic topic of “White House”.
Webmaster subject knowledge in Schema
Google is developing use of things like Schema, but the definition of the type of things that are described by Schema is provided by webmasters. In this way, webmasters are able to contribute to building the knowledge graphs along with the search engines.
For example, Google has added “knows-about” as an aspect of Schema. However, webmasters are the ones that indicate that lawyers can know about admiralty law or patent law, which in turn help fill out the knowledge graph.
The machine-based representation of knowledge is a collaborative effort.
[Case Study] Managing Google’s bot crawling
Evolving search and outdated SEO practices
– Repeated words in alt text
Telling Google that a photograph of a person needs to be named twice doesn’t help a Google understand it twice as well. It’s even possible that it could decrease the search engine’s estimation of the page’s value.
– LSI intended for small static databases
Toolmakers keep suggesting that SEOs use old techniques. One example is latent semantic indexing (LSI), which was developed in 1989. It was intended for small, static databases that aren’t the size of the web and don’t grow at the rate the web does.
Every time you want to use LSI, you need to have the latest version of the database. If you keep on adding information to the corpus, it needs to be run again. This means it’s not very useful for the web.
– TF-IDF works with access to full corpus only
TF-IDF (term frequency-index document frequency) is another example. This works best if you have access to the full corpus of the information being indexed, in this case the world wide web. You use TF-IDF when you want to know which are the most common words, and which are rare words across the entire corpus. But if you only use the corpus of the top ten ranking page for certain terms instead of the whole web, you can’t establish actual term frequency.
This can seriously affect the accuracy of your analysis.
Webmaster expectation and Google capabilities: need for communication from Google
Despite recent announcements, we don’t actually know that pagination markup isn’t useful to search engines.
Although pagination markup is no longer used to manage duplicate content on paginated pages, we have certain expectations of Google. They should be able to understand when pages are in a series. Announcements like this reveal the difficulty of knowing how good or how bad Google is at what they do.
Using frequently co-occuring words
Bill’s favorite technical trick is looking at frequently co-occuring words that rank highly for certain terms and making sure he uses those in content, both in the body and in anchor text pointing from his page to related pages. This takes advantage of “anchor hits”, which are supposedly treated by search engines as “expert links.”
This strategy is drawn from phrase-based indexing.
– Statistical probability of phrase co-occurence
The phrase-based indexing patent was updated about two years ago. This approach now uses how many related terms appear on pages to rank the pages.
However, if more than a statistically probable number of related terms appear on a page, it can be marked as spam. For example, if you scraped a lot of pages on a topic and put them all on one page, you’d have too many related terms for it to have happened naturally.
This fits well with the way Bill does keyword research. He looks at similar pages and creates a list of similar phrases or words that frequently occur. He may try to use some of them on his own page, even if he’s not trying to rank for them. This builds content relevant to the keywords he does want to rank for.
LSI vs using synonyms or semantically related content
The hype around LSI is one of Bill’s least favorite topics, in part because the term is misleading. What many people are suggesting when the talk about LSI has nothing to do with latent semantic indexing. Instead, they’re just suggesting adding synonyms or semantically-related content to pages.
Phrased-based indexing’s inverted index, and knowledge bases that can provide context terms indicate that there are terms, and sources you can go to, to find words that might be helpful if you’re strictly looking for co-occurring terms on high-ranking pages for your keyword.
Words that seem like they’re synonyms sometimes aren’t, in Google’s estimation.
Quick indexing with the URL submission tool
The URL Submission Tool in the new version of the Google Search Console is a really quick way to get pages indexed. Bill has seen updates propagated to the SERPs within a minute or two.
Bill’s hope for future markup: more information for patents
Audience question: What Schema markup would you like to see added in the future?
Because he writes a lot about patents, Bill would like to see a better way to capture the unique features of patents. Some of these features include :
- Classes (what the patent is intended to address)
- Patent name, though “main entity of page” could cover this feature
Since Google already allows you to search based on Schema features, the finality would be to be able to improve patent look-up, so that people could ask to see patents that cover certain categories.
Is Answer Engine Optimization the future of search?
Audience question: Do you think SEO will become AEO in the future?
Bill believes that, in a way, SEO has always been AEO.
– Older indications of Google as an answer engine
We’re not necessarily going through an evolution. There are 15-year-old indications that Google was headed in this direction, for example:
- 2004: Dictionary feature allowing users to search for the meaning of words
- 2005: “Just the facts” blog post showing the first featured snippet or direct answer that wasn’t satisfied through providing ten blue links, but preferred providing a textual response.
– Sergey Brin: patent for algorithm to understand facts and relationships between facts
Another indication that Google as an answer engine is nothing new is a patent by Sergey Brin on an algorithm to understand facts and relationships between facts. This patent included five books, their titles, their publishers, their authors, and so on.
The theory is that a bot would crawl the web searching for these books and–
[Interruption by OK Google]
– Audio watermarks
There’s also the concept of audio watermarks that take advantage of ultra-high frequency. They would fall outside of the range of human hearing, but dogs and computers would be able to identify them. This might allow different providers to track the fact that you’ve heard a watermarked commercial and might potentially be interested in the product.
This has been around for at least five years, and isn’t something that’s been discussed in SEO.
“There is a lot of misinformation about topics such as RankBrain, Neural Matching, and Machine Learning on the Web. Some of it includes carefully researched facts mixed with misinformation, so be careful about what you rely upon.”
SEO in Orbit went to space
If you missed our voyage to space on June 27th, catch it here and discover all of the tips we sent into space.