Data Lake

Glossary

"Data Lake"

This glossary explains various keywords that will help you understand the mindset necessary for data utilization and successful DX.
This time, we will explain about "data lakes," which serve as a foundation for data utilization.

What is a Data Lake?

A data lake is a data infrastructure (data repository) that can receive and store a wide variety of data generated from various systems, just as a lake stores water.
As the use of IT advances, it has become necessary to handle a much wider variety of data than before. Structured data) as well as non-XML data (such as JSON and XML) semi-structured data or the text of an email Unstructured Data) for data analysis, and more recently Machine Learning (AI) is increasingly being used for images and videos. binary data There is also a growing need for data utilization such as:

A data lake is a data infrastructure that allows data that is difficult to store directly in traditional databases or DWHs to be stored and used in its original form, just like a lake accepts water.

Why it's necessary: Differences from DWH (Data Warehouse)

As a "data storage platform" for data analysis, "DWH," a type of analytical database, has been popular and widely used before data lakes. First, we will explain why data lakes came into use when DWHs were already in use, and what the differences are.

"DWH" stores data in a predetermined format

When storing data in what is commonly called a "database" (RDB or DWH), before storing the data in the database, you need to "predefine" the format of the data to be stored.

For example, if you want to store data for a list of employees, you need to define in advance what format the data will be stored in, such as "Name (string)," "Employee number (integer)," and "Department (Department ID)," and then secure a "place to store the data," and then prepare the data to fit that format before storing it.

Advantages and disadvantages of "DWH"

There are advantages to deciding the data format in advance. Data is always stored in a fixed format, and you can expect that "all data is organized," making it easier to utilize stored data.

However, in an age where a wide variety of data is generated every day, the fact that data cannot even be stored unless it is defined and organized in advance is a noticeable disadvantage. Even if you think, "I want to use this data right now," you first need to define the data schema (what format the data will be stored in), secure a place to store the data, and pre-process the data to comply with the rules before you can finally store the data. The data can only be used after all of this is complete.

In particular, in the cloud era, where large amounts of data are generated, data utilization has become stagnant if "new data is constantly being generated" and "simply storing it is time-consuming."As a result, "data lakes," data infrastructures that can store data without having to organize it, have been developed and are now being used.

Characteristics of a data lake and the situations in which it is useful

Next, let's look at how these characteristics of a data lake can change how you use data.

As a storage location for "raw data"

Compared to DWH, data lakes have the advantage of being able to easily store data as it arrives from its source, or in other words, "raw data."

"Preprocessing data" means modifying the original data. It also means discarding some of the information contained in the original data. You can always preprocess data and discard some of the information depending on your purpose, but you cannot restore the information that has been discarded.

If you treat data as an asset and value it, you should be prepared for unexpected data needs later. To fully utilize the potential of data, it is safer to "leave" the initial "raw data" with all information intact. A data lake can be used as a means to store raw data as is.

Lowering the barriers to data utilization

For example, let's say your company is trying to create a data infrastructure that collects internal data. Starting such an initiative is difficult, and it's not easy to establish it within the company. Even if the people on the front lines are willing to try something, even if it's a simple idea, they would be grateful if they were to take action to collect on-site data on their work.

In such a situation, what would happen if you said, "You must format the data in a specified way before putting it into the DWH," or, "Data that is not formatted according to the rules cannot be stored," or, "It cannot be used unless it is stored"? Wouldn't that mean fewer people willing to cooperate? Problems could arise, such as a lot of data being discarded, or pre-processing taking too long and the data losing its freshness.

If we consider organizing data to be a means, not an end, then a data infrastructure that can simply store data can be beneficial for promoting data utilization.

"Schema on Read" and "Schema on Write"

Let's say you want to use data for analysis. Whether you use a DWH or a data lake, when you actually start analyzing, the data needs to be prepared in a form suitable for the work (or to a degree that does not cause any problems). In other words, data preparation work will eventually be required.

From this perspective, a data lake can be thought of as a way of postponing the task of aligning data with a data schema, while a DWH is a way of making it mandatory from the start and retaining only clean data. In other words, they can be thought of as a way to increase options, allowing you to choose how to use and approach data.

DWH requires that you understand in advance how the data will be used, define the data schema, and then convert the data into the desired data format according to its intended use at the time of writing (storing). This feature of forcing the data schema at the time of writing is sometimes called "schema on write."

On the other hand, data lakes do not enforce the application of data schemas at the time of writing. This means that data can be stored without the need to understand in advance how it will be used, and without the hassle of preparatory work. The application of data schemas is done "after reading from the data lake" in order to use the data. This feature is sometimes called "schema on read."

How to achieve a data lake

So, what specific technologies are used to create a data lake? When it comes to a regular database (RDB), you probably imagine using widely known database engines and cloud services such as PostgreSQL. However, when you hear the term "data lake," you might not be able to think of any product names. However, it is actually possible to create a data lake using well-known technologies.

Hadoop is now the foundation of data lakes

First, you can build a data lake based on Hadoop, which was hugely popular for a while (around 2010). When you hear Hadoop, many people think of "big data," but it is now used as the foundation for data lakes.

Hadoop was born in the cloud era, when "unprecedented amounts of diverse data" were being generated. It was difficult (at the time) to process such data using traditional technologies such as RDBs and DWHs, and so it was developed in response to the need for a new era of data infrastructure.

In a situation where new data is constantly being generated, a data infrastructure that assumes that data must be prepared in advance is not sufficient. Therefore, a data infrastructure that can store a wide variety of data as is and can store and process even huge amounts of data was developed. Currently, Hadoop-based products are sometimes used to build data lakes.

Object storage such as Amazon S3

Additionally, various cloud services offer services suitable as the foundation for a data lake, such as object storage services like Amazon S3.

Compared to services that allow you to use databases on the cloud (such as Amazon RDS), it may be difficult to imagine what Amazon S3 (object storage) is used for. However, because you can "save anything you want," "reliably store data without causing any problems," and "once you save it, it can be conveniently used from most cloud services," it is an important service that serves as the data infrastructure for all cloud services.

It can safely store any type of data, even very large amounts of data, and is designed to be readable and writable from most AWS services, making it the go-to service for data handling: simply store it in Amazon S3.

Object storage was not created as a service for the DWH concept, but it does provide the necessary properties for a data lake, so data lakes are often built using Amazon S3, for example.

Data Swamp

So far, I have written about the significance and advantages of data lakes, but data lakes are sometimes criticized as a concept that leads to a "data swamp."

The criticism is that as a result of freely accepting data, the data ends up being stored in a disorganized manner, making it difficult to know what data is where, and that this situation is referred to as a "swamp, not a lake," and that "the reality of data lakes is a data swamp."

This phrase is sometimes used as a position statement, meaning "that's why data lakes are no good (and you should use a DWH)," because such situations are less likely to occur with a DWH, which prepares data in advance, but it is also used to warn people that data lakes must be used with caution.

Data catalog development

To prevent the data lake from becoming a "data swamp," care must be taken to manage and store the data. Furthermore, it is desirable to establish a system for "visualizing" the data in the data lake, such as making it possible to see what data is stored there (data catalog) and how the data arrived (data lineage).

Often lacks the ability to analyze and aggregate data

Data lakes also tend to lack the ability to search and aggregate the stored data. Many systems lack full SQL functionality, meaning they are unable to use flexible queries and have limited search capabilities. Even if searches are possible, they can be slow and require a lot of processing time.

Combined use of DWH and data lake

In other words, data lakes are not necessarily superior to DWHs.

Use in combination

As explained above, DWH and data lakes each have their advantages and disadvantages. It is advisable to use them separately or in combination depending on the purpose. For example,

Data is first stored in a "data lake" as a place to store raw data as is.
Preprocess the raw data stored in the data lake and feed it into the DWH
Data analysis is performed on "data stored in DWH" using BI tools.

This allows you to take advantage of the "data lake's ability to receive data" and the "DWH's ability to handle data efficiently." You can further refine it depending on your company's circumstances and needs, for example, to create a system like the one below.

Data Lake: A place to store the original raw data as is
Data Lake: A place to organize and store data that would otherwise be unusable
DWH: A data repository suitable for analysis
DWH: A place to store data for analysis
BI tools and machine learning: Refer to DWH and data lakes as needed

To effectively utilize a data lake, "data integration" is necessary

To begin with, data lakes didn't exist a generation ago, and the state of data-related technology is sure to continue to change. The state of your company's IT systems and the data you hold will also change. Furthermore, as your company's business situation changes, your data situation and needs will also change.

If that's the case, then the "ideal form" of your company's data infrastructure will likely continue to change. If there is no "correct answer" that will last into the future, then being able to freely combine and use DWHs and data lakes as needed to suit the situation and continuously respond to future changes in the technological landscape and needs will be a better way to prepare for the future.

DWH and data lakes require "free data integration" with external data and systems

Furthermore, just as the challenges and needs for data integration through ETL were discovered due to the practical difficulties of using a DWH, there is a need for data integration method that solves the hassle of "data integration with external data and systems" when using the data lake itself.

So, in order to make effective use of data lakes, what methods are required to data integration?

Ways to bring data into the data lake:
Data exists in a variety of formats across various systems, both internally and in the cloud.
Ways to process data on the data lake:
Data is often not pre-processed, and naturally, it may be necessary to process the data, such as by formatting it, before it can be used.
Ways to extract accumulated data and use it externally:
Data needs to be available in external systems, such as by extracting data from the data lake and feeding it into a DWH, or by using it in an external system. Furthermore, multiple data lakes may be used in combination.

Therefore, you need data integration tool that has the following characteristics:

Support for a wide variety of data formats

What is the difference between a DWH and a data lake? If a data lake can only handle organized data with rows and columns, it will defeat the purpose of introducing it. It needs to be able to handle a wider variety of data formats.

Can be connected to various systems and data

There are many different types of data in many different places, and even different products and services for data lakes themselves. You should be able to use them however you need.

Sufficiently high processing performance

Data lakes were born out of the big data boom. They need to be able to quickly connect and process large amounts of data. Simple, convenient tools for connecting data can sometimes fail to provide practical performance.

High data processing capabilities

Data in a data lake is often not prepared in advance. It is desirable to have a way to process the data as needed depending on how it is used. If the only function available is to simply transfer data from right to left, it may not be possible to do what is needed.

No-code/low-code (business sites can use it themselves)

Manually data integration and processing data every time something happens takes too much time and effort, and even if you document your requests and request system development each time, it takes too much time. There is also a tendency that you don't know much about data utilization until you actually try it, so it is not realistic to analyze what is necessary for a linked system in a requirements analysis beforehand.
If this is the case, it is necessary to be able to quickly change and implement ways of using data, with the data being used on-site at the forefront.

If there were no-code or low-code tools that could freely develop data integration using only a GUI, these needs could be quickly resolved on the ground, and data utilization could be promoted efficiently.

Related keywords (for further understanding)

DWH
- A database for storing data to be analyzed. It has specialized performance for analysis, and is often suited to storing large amounts of data and executing analytical processing.
ETL
- In the recent trend of actively working on data utilization, the majority of the work is not the data analysis itself, but rather the collection and preprocessing of data scattered in various places, from on-premise to cloud.
Object Storage
iPaaS
- A cloud service that "connects" various clouds with external systems and data simply by operating on a GUI is called iPaaS.
No-code/Low-code

Related Products

When introducing a data lake, it can be difficult to make progress in utilizing data because you don't know where and in what form the diverse data is stored. A data catalog is a way to "see" what data is stored where. If you're interested, please see below.

Collect, organize, and catalog scattered metadata to improve data exploration efficiency and help understand the "contents" of the data

DataSpider trial version and free online seminar

"DataSpider," data integration tool developed and sold by our company, also has ETL functions and is data integration tool with a proven track record as a means of supporting the utilization of DWH.

Unlike regular programming, development can be done using only the GUI (no-code), without writing any code, and it offers "high development productivity," "full-fledged performance that can serve as the foundation for business (professional use)," and "ease of use that can be used by those in the field (even non-programmers can use it)."
It can smoothly solve the problem of "connecting disparate systems and data," which is hindering not only data utilization but also the successful utilization of various IT technologies such as cloud computing.

We offer a free trial version and hold online seminars where you can try out the software for free, so we hope you will give it a try.

Free trial seminar here

"Data Lake"

What is a Data Lake?

Why it's necessary: Differences from DWH (Data Warehouse)

"DWH" stores data in a predetermined format

Advantages and disadvantages of "DWH"

Characteristics of a data lake and the situations in which it is useful

As a storage location for "raw data"

Lowering the barriers to data utilization

"Schema on Read" and "Schema on Write"

How to achieve a data lake

Hadoop is now the foundation of data lakes

Object storage such as Amazon S3

Data Swamp

Data catalog development

Often lacks the ability to analyze and aggregate data

Combined use of DWH and data lake

Use in combination

To effectively utilize a data lake, "data integration" is necessary

DWH and data lakes require "free data integration" with external data and systems

Support for a wide variety of data formats

Can be connected to various systems and data

Sufficiently high processing performance

High data processing capabilities

No-code/low-code (business sites can use it themselves)

Related keywords (for further understanding)

Related Products

DataSpider trial version and free online seminar

Glossary Column List

Alphanumeric characters and symbols

A row

Ka row

Sa row

Ta row

Na row

Ha row

Ma row

Ya row

Ra row

Wa row

Recommended Content

Related Content

MFT（Managed File Transfer）

Transfer learning

Elastic (elasticity and flexibility)