DataLakes & DataWarehouses: How they are used in SEO

February 16, 2021 - 17  min reading time - by Xavier Stevens
Accueil > Technical SEO > DataLakes & DataWarehouses

Although the concepts of DataWarehouses and DataLakes became part of the everyday language of Data Analysts and Data Scientists a long time ago, we have only been hearing about them in other industries for the past few years.
For example, Web Analysts and SEO Experts are beginning to take a serious look at these concepts, due to the nature of their jobs and the strong connection that exists between what they do and data manipulation. Many recent articles talk about the interest of implementing an SEO DataLake or an SEO DataWarehouse, treating the two terms as interchangeable and without making a distinction between the two.

In this article, we will guide you in determining the differences between DataLakes and DataWarehouses in order to understand their purposes and their use cases in SEO and web analytics.

DataWarehouse: structured storehouse for data

The first use of the term “DataWarehouse” dates back to 1988 in a paper by Paul Murphy and Barry Delvin, An architecture for a business and information systems. This article gives us a first definition of the concept as an easy-to-access, relational database environment, bringing together all the business data useful for strategic decision-making.

What does a DataWarehouse contain?

The DataWarehouse is used to gather the business data useful for strategic decision-making for the company in a single place. We are talking about business data that can cover anything from customer data, to inventory information, to conversions on a commercial website or organic visits (from a search engine such as Google for example).

It is commonly accepted that the data sent to a DataWarehouse are structured, pre-processed data used to offload operational databases, which ultimately allows these operational databases to be solicited as little as possible for query purposes.
The main goal of a DataWarehouse and those who administer it is to compile data from various, heterogeneous sources (both internal and external) in order to standardize it so that the various sources can communicate with each other. The end goal is to use this data to carry out analyses, reporting, decision-making support, etc.

Who are the daily users of a DataWarehouse?

Because of the nature of the DataWarehouse and the format and type of data it contains, it is an ideal playground for Data and Web Analysts.
Data Analysts works alongside the DataWarehouse administrator (or administration team). They define the business needs and use cases. They identify the data sources and the actions needed to process the data upstream. These data will then be used by the Data Analysts at the end of the chain.

How do users communicate with a DataWarehouse?

Once the data sources have been identified, and the data processed, ingested and linked in the DataWarehouse, the Data Analyst can use this data in analyses and to create new data combinations. This process can be used to maintain reporting dashboards, alerting dashboards, etc.

The most commonly used programming language for querying in a DataWarehouse is SQL (or SQL-like languages). SQL allows Data Analysts to manipulate and process data in order to meet business needs: monitoring, strategic decision-making, etc.

What use cases and project types do DataWarehouses serve?

It is impossible to draw up an exhaustive list of use cases involving the use of a DataWarehouse. However, here are a few examples of projects on which a Data Analyst is likely to work:

Improvement of a DataWarehouse:
This type of project is often encountered when setting up a DataWarehouse, but also when a new need or business use case is identified.
It is a question here of adding new data to a DWH (again, this can be internal or external data).
In this case we often speak of an ETL (Extraction-Transformation-Loading) process:

  • Extraction:
    A first step consisting in identifying and collecting data from the various sources needed for further operations.
  • Transformation:
    This second step is very important, because without adjustment, without standardization, it is generally impossible to use new data and to make them communicate with those already existing in the DWH.
    It is therefore a phase of necessary standardization which can sometimes be complicated by the rigidity imposed by the DWH in terms of formatting and table schema.
  • Loading:
    Phase of ingestion of the data processed (and thus structured) in the DWH.

Realization of statistical analyses:
This is a very frequent use of DWHs. The goal may be to prove X or Y through the data, to produce statistics based on the available historical data, or to establish causal links to explain a finding, etc.
Reporting & alerting:
This is, once again, a very frequent use case. In fact, as the data in a DWH are highly structured and formatted (sharing a fixed and predefined schema), it is all suitable for pushing data to reporting or alerting dashboards.

This is a recurring request from top management, who need to be able to monitor operational teams and the health of results, sales, etc. in the simplest and quickest way possible.

If we summarize all of these, we have more or less 2 types of projects: data acquisition and integration projects (which can also be compared to a form of data storage and historization) and data analysis and evaluation projects (through monitoring/dashboarding and alerting).

The concept of a DWH has been present in the everyday language of those who work with data for a long time. How it works and its numerous use cases have long since been confirmed, and DWHs can be found in many companies of varying maturity where questions of data management are concerned.

This is less the case for the concept of DataLakes, which is much younger and much less widespread.

Oncrawl Data³

Expand your analysis with seamless connections to additional datasets. Analyze your SEO strategy based on data on backlinks, SEO traffic, rankings, and custom datasets from your CRM, monitoring solution, or any other source.

DataLake: lake of megadata (BigData)

The origin of this concept is attributed to James Dixon, CTO of Penthao, who defines it as a solution for storing and exploiting large volumes of data, without pre-processing and without necessarily a specific use case… Unlike DWHs, which are very much oriented towards immediate activation.
The DL tries to fill the gap, which is more and more important with the emergence of BigData, of what to do with all this mass of data that we are able to collect today and how to take advantage of it.

What does a DataLake contain?

I’ll start by quoting James Dixon who uses a very evocative comparison, serving both as an explanation for the “lake” name of his concept and as a differentiation with the DWH:

“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

This quote perfectly illustrates the difference between the type of data contained in a DWH, which is structured and organized in tables with precise, fixed patterns, and the type of data contained in a DataLake, which is raw, without prior processing, available to take samples from as needed, whether exploratory or not.

Where a DWH is restricted to accommodate structured data, the DataLake is made to store all kinds of raw data (structured or not). A debate between Tamara Dull (Amazon Web Service) and Anne Buff (Microsoft SAS) gives us a slightly more concrete vision of the content of a DataLake:

“A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed.”

Who are the daily users of DataLakes?

Where a Data Analyst was perfectly suited to work with the structured data contained in a DHW, raw data is instead the specialty of Data Scientists, who are often better equipped to manipulate this type of data.
This change in data profile and main user also results in different programming languages and use cases.

What use cases and project types do DataLakes serve?

Due to its unstructured nature and the considerable volume of data that a DataLake can contain, the use cases can be very different from those previously found in the DWH framework, for example:

  • The implementation of machine learning algorithms to create added value for BigData:
    We often talk here about predictive analysis, based on machine learning algorithms exploiting all kinds of data.
    To take a more concrete example, let’s imagine that a company in the financial sector (banking & insurance) wants to determine the probability that a financial transaction X is fraudulent. This may call upon Data Scientists, capable of creating machine learning algorithms that will train on the astronomical amount of data contained in the DataLake (amount, date, frequency, usual profile of transactions carried out by the account owner, etc.). The goal is to carry out a predictive study that will be used to identify potentially fraudulent transactions and thus allow the company to reduce its reaction time in detecting them and ultimately avoid big losses for them and their customers.
    This is a simple example that is regularly used to illustrate the interest and added value of machine learning, but there are as many others, as you might imagine.
  • DataLakes as a data source for a DataWarehouse:
    Very simply, a DataLake can act as a transit zone between your various internal and external data sources and your DWH. The very principle of a DataLake is to centralize all kinds of data, structured or unstructured, in order to carry out predictive studies via ML, or for extraction as samples for analysis. The DWH therefore seems very suitable for this second category of project and benefits from a DataLake as a potential source (provided that the DataLake data is imported in a structured way via pre-processing, if necessary).
  • From DataLake to BI (Business Intelligence) software:
    We can see this as a use similar to the one we saw with DataWarehouses, thought there are certain specificities to using a DataLake for this purpose. A DataLake will allow you to make slightly more exotic visualizations (due to the variety of data it contains), via tools such as Tableau, Qlikview, Google Data Studio, Microstrategy, etc.

How do users communicate with a DataLake?

Given the use cases and users (Data Scientists), we will very often find programming languages such as Python, Java, R, Scala, etc…
For the most part, these languages have been present in the field of data science for a long time.

A DataLake is therefore a tool for managing BigData. It relies on the massive storage of raw data for advanced analysis and visualization purposes, thus allowing the enhancement of data that had not previously been put to much use.

To summarize, here is a table of the differentiating elements established since the beginning of this article:

DataWarehouseDataLake
Type of dataStructured, pre-processed data, organized in tables with defined schemasRaw data, stored in a structured or unstructured manner
UsersData Analysts, Web AnalystsData Scientists
(sometimes Data Analysts)
Data volumeSmall – Large
(Depending on need and use case)
Potentially very large
(Big Data)
Programming language usedSQL or SQL-likePython, R, Java, Scala, among others
Type of projectAnalytical & statistical projects, Reporting, Alerting, ELT (export, transform, load) type projects, some predictive and data-driven analysisPredictive analysis, machine learning, transit zone between data sources and DWH, advanced visualization – BI, data-driven analysis

Predictive analysis, machine learning, transit zone between data sources and DWH, advanced visualization – BI, data-driven analysis

It is these differences that make these two concepts complementary tools. In many cases, depending on the maturity of a company’s governance and data management, they may rely on a combination of these two tools.
A DWH is mainly used for traditional reporting and analysis, whereas a DataLake serves as a data source before reaching its full potential as the company approaches maturity on data subjects.

In my opinion, DataLakes are more a response to the new data issues of the 21st century, especially with the emergence of BigData and the increasing capacity of companies to collect data, than a replacement for DWHs, as some might think.
Both have their advantages, disadvantages, strengths and weaknesses. The best way to make the most of both is still to use both together to be able to deal with any eventuality and to address a wider variety of needs.

Now that we’ve clearly defined the concepts, we will finally focus on the use of DataWarehouses and DataLakes for marketing and more specifically for SEO (even if in many cases, what is true for the former will be true for the latter, and vice versa).

DataWarehouse and DataLake SEO

We will talk here of a DataWarehouse or a DataLake (or both) where at least part of the data present can be used for SEO use cases.

Why associate DataLakes and DataWarehouses with Marketing and SEO?

SEO (and, more generally, marketing) has already taken a very marked turn towards data in recent years. More and more tasks require the use of various data sources:

  • Analytical data (Google Analytics, AT internet, etc.)
  • Performance data (Google Search Console, Analytics)
  • Log data, a very large data “source” for some sites, which requires a high update frequency and a large storage capacity.
  • Netlinking data (Majestic, Ahrefs, Babbar)
  • Positioning data (SEMRush, Monitorank, etc.)
  • Crawl data (OnCrawl, etc.)
  • Sometimes business/industry data as well

To this list we should also add the use of APIs of tools such as Search Console, Majestic, Google Analytics for example, which naturally pushes us towards the kind of solutions described earlier in this article.
It is this strong connection between SEO and Data that pushes more and more Web Analysts and SEO Experts to learn about new ways to organize their data pipeline.

However, the drivers for this transition are not only about the potential and interconnectedness of SEO and Data. Many everyday use cases resonate with the project types listed above for DWHs and DLs.

The use cases of an SEO DataWarehouse or an SEO DataLake.

I will first start from pain points commonly encountered by SEO Experts before explaining how the use of a DataLake or a DataWarehouse is an answer to be considered when addressing them.
Among the main pain points, the following stand out:

  • The multiplication of Excel files (the loose-leaf paper of our decade) and the associated copy-and-pasting:
    For many SEO this is still the norm, but let’s be honest, it is both time-consuming, constraining and very conducive to human error.For this, a DataWarehouse is a perfect solution. DataWarehouses not only allow all the KPIs required to perform this or that audits/analyses to be gathered from the various available data sources, but also allow the processing that is required to achieve the expected result to be automated.
    As a DataWarehouse is built, more and more use cases are identified and more and more problems are solved, leading to increasingly significant time savings over time.
  • Capacity limits (as a reminder, Excel can only open an entire file if it does not exceed 1,048,576 lines. This seems like a lot, but is not actually that much in today’s volumes):There’s not really any particular use case here, because in general, both DataLakes and DataWarehouses don’t suffer from this kind of limit. They both offer the means to request large volumes of data for any type of need. For this specific case, it is important to keep in mind that, depending on the need, one or the other will allow you to free yourself from capacity limits and, ultimately, to address these situations more easily.
  • Respond to a need for data historization
    Spoiler: one of the use cases can be, for example, to save a history of the data from Google Search Console in an SEO DataWarehouse, rather than copy and page its data in a Google Sheets every week to maintain a Data Studio dashboard.In my opinion, we have here one of the most common use cases among SEO Experts, whether in agencies or in-house: data historization. Indeed, many SEO Analysts look at historical data and draw conclusions from it.
    The example that may have come to your mind directly is the case of Google Search Console. It only provides access to 16 months of history today (even via API). And if a manual backlog remains possible via exports to be pasted into Google Sheets every week (or other obscure methods), it is a considerable waste of time in addition to being painful and boring.
    That’s a good thing because it’s a relatively simple problem to address with a DataWarehouse. All you have to do is set up an automatic connection to the Google Search Console API, define the various possible pre-processing and data combinations needed to obtain data with real added value, and finally, automate the API calls.
  • The desire to take analyses further, to merge or “cross-analyze” crawl data, audience data, logs, etc. in an industrialized way.
    Because a small competitive advantage never hurts.The descriptions we have given of a DataWarehouse and a DataLake speak for themselves here. One of the primary aims of both tools is to open up new possibilities for analysis, through data collection and cross-analysis and/or machine learning.
    To cite only one very representative example; the use of machine learning algorithms such as Random Forest or XG-Boost to make ranking predictions on Google.
    Very simply, the idea is to train an algorithm on a large number of Google SERPs (result pages) and all the SEO metrics harvestable for these SERPs in order to determine, based on those same metrics, the ranking potential of a given URL (and therefore, even more particularly, to determine the most important metrics to rank in a particular sector/theme).
    → You will find the complete methodology in the article by Vincent Terrasi, Product Director at Oncrawl, “Successfully predicting Google rankings at the cutting edge of data science”, 2018.
  • The desire to automate reporting as much as possible, in order to focus on high value-added tasks.Again, this falls literally within the classic use cases of a DataWarehouse. It offers the possibility of automating the entire recovery and processing of the various data sources, and it perfectly addresses this pain point. Once set up, a table will be automatically fed into the DWH and can be used as a connection to BI software for dashboarding, whether for monitoring, alerting, etc.Of course, automation does not stop at reporting projects alone. Both a DWH and a DL can be used for many automated SEO optimizations. For example, dynamic updates to internal link blocks on rank, on crawl budget, on SEO audience, etc. (all data contained in the DWH).
  • The desire to put an end once and for all to security concerns (we know who did what and where to find them) and avoid spending time on maintenance.We end here on a more process-oriented aspect than a use case, strictly speaking.
    Both DataLakes and DataWarehouses imply the implementation of particular processes which can be presented in the following simplified way:

    • The starting point is an observation that is broken down into a statement of needs (business team / SEO – Data Analyst).
    • Then, this is transformed into a more technical specification that will allow the team administering the tool to understand what needs to be done and how it needs to be done.
    • This same administration team carries out the request.
    • The business team and Data Analysts produce a procedural use-case for the work carried out.
    • There is an on-going process in which the two ends of the chain (business team and administration team of the DataWarehouse or DataLake) make sure that nothing changes in terms of input and output.
      This is particularly the case for a DWH, which will reject any data that is not part of the structure (the pre-defined schema).

Again, this is a non-exhaustive list of pain points and possible use cases for DataWarehouse – DataLake SEO. Limits are encountered more through the lack of imagination of those who use them than in the tools themselves.

Choosing a DataWarehouse or DataLake for your SEO uses

To conclude, contrary to what you may often hear or read, DataWarehouses and DataLakes are separate structures for data storage and collection, and are not incompatible. There is no need to choose one over the other, quite the opposite. Both have different use cases and there are even some adhesions.

The case of SEO is a telling example, and reinforces the need for DataWarehouses and DataLakes in general. Data is omnipresent in SEO: we have to manipulate huge amounts of data from different sources. So it’s not surprising that we talk about DataWarehouses and DataLakes in this context. We can imagine a lot of use cases of DataWarehouses or DataLakes in SEO, whether it is for automation purposes, to perform “augmented” analysis through data, or simply to solve recurring issues (pain points).

Xavier Stevens is a consultant at Clustaar agency, specialized in data processing languages and tools to analyze and better understand the new trends in SEO.
Related subjects: