Data Lake
"Data Lake"
This glossary explains various keywords that will help you understand the mindset necessary for data utilization and successful DX.
This time, we will explain about "data lakes," which serve as a foundation for data utilization.
What is a Data Lake?
A data lake is a data infrastructure (data repository) that can receive and store a wide variety of data generated from various systems, just as a lake stores water.
As the use of IT advances, it has become necessary to handle a much wider variety of data than before. Structured data) as well as non-XML data (such as JSON and XML) semi-structured data or the text of an email Unstructured Data) for data analysis, and more recently Machine Learning (AI) is increasingly being used for images and videos. binary data There is also a growing need for data utilization such as:
A data lake is a data infrastructure that allows data that is difficult to store directly in traditional databases or DWHs to be stored and used in its original form, just like a lake accepts water.
Why it's necessary: Differences from DWH (Data Warehouse)
As a "data storage platform" for data analysis, "DWH," a type of analytical database, has been popular and widely used before data lakes. First, we will explain why data lakes came into use when DWHs were already in use, and what the differences are.
"DWH" stores data in a predetermined format
When storing data in what is commonly called a "database" (RDB or DWH), before storing the data in the database, you need to "predefine" the format of the data to be stored.
For example, if you want to store data for a list of employees, you need to define in advance what format the data will be stored in, such as "Name (string)," "Employee number (integer)," and "Department (Department ID)," and then secure a "place to store the data," and then prepare the data to fit that format before storing it.
Advantages and disadvantages of "DWH"
There are advantages to deciding the data format in advance. Data is always stored in a fixed format, and you can expect that "all data is organized," making it easier to utilize stored data.
However, in an age where a wide variety of data is generated every day, the fact that data cannot even be stored unless it is defined and organized in advance is a noticeable disadvantage. Even if you think, "I want to use this data right now," you first need to define the data schema (what format the data will be stored in), secure a place to store the data, and pre-process the data to comply with the rules before you can finally store the data. The data can only be used after all of this is complete.
In particular, in the cloud era, where large amounts of data are generated, data utilization has become stagnant if "new data is constantly being generated" and "simply storing it is time-consuming."As a result, "data lakes," data infrastructures that can store data without having to organize it, have been developed and are now being used.
Characteristics of a data lake and the situations in which it is useful
Next, let's look at how these characteristics of a data lake can change how you use data.
As a storage location for "raw data"
Compared to DWH, data lakes have the advantage of being able to easily store data as it arrives from its source, or in other words, "raw data."
"Preprocessing data" means modifying the original data. It also means discarding some of the information contained in the original data. You can always preprocess data and discard some of the information depending on your purpose, but you cannot restore the information that has been discarded.
If you treat data as an asset and value it, you should be prepared for unexpected data needs later. To fully utilize the potential of data, it is safer to "leave" the initial "raw data" with all information intact. A data lake can be used as a means to store raw data as is.
Lowering the barriers to data utilization
For example, let's say your company is trying to create a data infrastructure that collects internal data. Starting such an initiative is difficult, and it's not easy to establish it within the company. Even if the people on the front lines are willing to try something, even if it's a simple idea, they would be grateful if they were to take action to collect on-site data on their work.
In such a situation, what would happen if you said, "You must format the data in a specified way before putting it into the DWH," or, "Data that is not formatted according to the rules cannot be stored," or, "It cannot be used unless it is stored"? Wouldn't that mean fewer people willing to cooperate? Problems could arise, such as a lot of data being discarded, or pre-processing taking too long and the data losing its freshness.
If we consider organizing data to be a means, not an end, then a data infrastructure that can simply store data can be beneficial for promoting data utilization.
"Schema on Read" and "Schema on Write"
Let's say you want to use data for analysis. Whether you use a DWH or a data lake, when you actually start analyzing, the data needs to be prepared in a form suitable for the work (or to a degree that does not cause any problems). In other words, data preparation work will eventually be required.
From this perspective, a data lake can be thought of as a way of postponing the task of aligning data with a data schema, while a DWH is a way of making it mandatory from the start and retaining only clean data. In other words, they can be thought of as a way to increase options, allowing you to choose how to use and approach data.
DWH requires that you understand in advance how the data will be used, define the data schema, and then convert the data into the desired data format according to its intended use at the time of writing (storing). This feature of forcing the data schema at the time of writing is sometimes called "schema on write."
On the other hand, data lakes do not enforce the application of data schemas at the time of writing. This means that data can be stored without the need to understand in advance how it will be used, and without the hassle of preparatory work. The application of data schemas is done "after reading from the data lake" in order to use the data. This feature is sometimes called "schema on read."
How to achieve a data lake
So, what specific technologies are used to create a data lake? When it comes to a regular database (RDB), you probably imagine using widely known database engines and cloud services such as PostgreSQL. However, when you hear the term "data lake," you might not be able to think of any product names. However, it is actually possible to create a data lake using well-known technologies.
Hadoop is now the foundation of data lakes
First, you can build a data lake based on Hadoop, which was hugely popular for a while (around 2010). When you hear Hadoop, many people think of "big data," but it is now used as the foundation for data lakes.
Hadoop was born in the cloud era, when "unprecedented amounts of diverse data" were being generated. It was difficult (at the time) to process such data using traditional technologies such as RDBs and DWHs, and so it was developed in response to the need for a new era of data infrastructure.
In a situation where new data is constantly being generated, a data infrastructure that assumes that data must be prepared in advance is not sufficient. Therefore, a data infrastructure that can store a wide variety of data as is and can store and process even huge amounts of data was developed. Currently, Hadoop-based products are sometimes used to build data lakes.
Object storage such as Amazon S3
Additionally, various cloud services offer services suitable as the foundation for a data lake, such as object storage services like Amazon S3.
Compared to services that allow you to use databases on the cloud (such as Amazon RDS), it may be difficult to imagine what Amazon S3 (object storage) is used for. However, because you can "save anything you want," "reliably store data without causing any problems," and "once you save it, it can be conveniently used from most cloud services," it is an important service that serves as the data infrastructure for all cloud services.
It can safely store any type of data, even very large amounts of data, and is designed to be readable and writable from most AWS services, making it the go-to service for data handling: simply store it in Amazon S3.
Object storage was not created as a service for the DWH concept, but it does provide the necessary properties for a data lake, so data lakes are often built using Amazon S3, for example.
Data Swamp
So far, I have written about the significance and advantages of data lakes, but data lakes are sometimes criticized as a concept that leads to a "data swamp."
The criticism is that as a result of freely accepting data, the data ends up being stored in a disorganized manner, making it difficult to know what data is where, and that this situation is referred to as a "swamp, not a lake," and that "the reality of data lakes is a data swamp."
This phrase is sometimes used as a position statement, meaning "that's why data lakes are no good (and you should use a DWH)," because such situations are less likely to occur with a DWH, which prepares data in advance, but it is also used to warn people that data lakes must be used with caution.
Data catalog development
To prevent the data lake from becoming a "data swamp," care must be taken to manage and store the data. Furthermore, it is desirable to establish a system for "visualizing" the data in the data lake, such as making it possible to see what data is stored there (data catalog) and how the data arrived (data lineage).
Often lacks the ability to analyze and aggregate data
Data lakes also tend to lack the ability to search and aggregate the stored data. Many systems lack full SQL functionality, meaning they are unable to use flexible queries and have limited search capabilities. Even if searches are possible, they can be slow and require a lot of processing time.
Combined use of DWH and data lake
In other words, data lakes are not necessarily superior to DWHs.
Use in combination
As explained above, DWH and data lakes each have their advantages and disadvantages. It is advisable to use them separately or in combination depending on the purpose. For example,
- Data is first stored in a "data lake" as a place to store raw data as is.
- Preprocess the raw data stored in the data lake and feed it into the DWH
- Data analysis is performed on "data stored in DWH" using BI tools.
This allows you to take advantage of the "data lake's ability to receive data" and the "DWH's ability to handle data efficiently." You can further refine it depending on your company's circumstances and needs, for example, to create a system like the one below.
- Data Lake: A place to store the original raw data as is
- Data Lake: A place to organize and store data that would otherwise be unusable
- DWH: A data repository suitable for analysis
- DWH: A place to store data for analysis
- BI tools and machine learning: Refer to DWH and data lakes as needed
To effectively utilize a data lake, "data integration" is necessary
To begin with, data lakes didn't exist a generation ago, and the state of data-related technology is sure to continue to change. The state of your company's IT systems and the data you hold will also change. Furthermore, as your company's business situation changes, your data situation and needs will also change.
If that's the case, then the "ideal form" of your company's data infrastructure will likely continue to change. If there is no "correct answer" that will last into the future, then being able to freely combine and use DWHs and data lakes as needed to suit the situation and continuously respond to future changes in the technological landscape and needs will be a better way to prepare for the future.
DWH and data lakes require "free data integration" with external data and systems
Furthermore, just as the challenges and needs for data integration through ETL were discovered due to the practical difficulties of using a DWH, there is a need for data integration method that solves the hassle of "data integration with external data and systems" when using the data lake itself.
So, in order to make effective use of data lakes, what methods are required to data integration?
- Ways to bring data into the data lake:
Data exists in a variety of formats across various systems, both internally and in the cloud. - Ways to process data on the data lake:
Data is often not pre-processed, and naturally, it may be necessary to process the data, such as by formatting it, before it can be used. - Ways to extract accumulated data and use it externally:
Data needs to be available in external systems, such as by extracting data from the data lake and feeding it into a DWH, or by using it in an external system. Furthermore, multiple data lakes may be used in combination.
Therefore, you need data integration tool that has the following characteristics:
Support for a wide variety of data formats
What is the difference between a DWH and a data lake? If a data lake can only handle organized data with rows and columns, it will defeat the purpose of introducing it. It needs to be able to handle a wider variety of data formats.
Can be connected to various systems and data
There are many different types of data in many different places, and even different products and services for data lakes themselves. You should be able to use them however you need.
Sufficiently high processing performance
Data lakes were born out of the big data boom. They need to be able to quickly connect and process large amounts of data. Simple, convenient tools for connecting data can sometimes fail to provide practical performance.
High data processing capabilities
Data in a data lake is often not prepared in advance. It is desirable to have a way to process the data as needed depending on how it is used. If the only function available is to simply transfer data from right to left, it may not be possible to do what is needed.
No-code/low-code (business sites can use it themselves)
Manually data integration and processing data every time something happens takes too much time and effort, and even if you document your requests and request system development each time, it takes too much time. There is also a tendency that you don't know much about data utilization until you actually try it, so it is not realistic to analyze what is necessary for a linked system in a requirements analysis beforehand.
If this is the case, it is necessary to be able to quickly change and implement ways of using data, with the data being used on-site at the forefront.
If there were no-code or low-code tools that could freely develop data integration using only a GUI, these needs could be quickly resolved on the ground, and data utilization could be promoted efficiently.
Related keywords (for further understanding)
- DWH
- A database for storing data to be analyzed. It has specialized performance for analysis, and is often suited to storing large amounts of data and executing analytical processing.
- ETL
- In the recent trend of actively working on data utilization, the majority of the work is not the data analysis itself, but rather the collection and preprocessing of data scattered in various places, from on-premise to cloud.
- Object Storage
- iPaaS
- A cloud service that "connects" various clouds with external systems and data simply by operating on a GUI is called iPaaS.
- No-code/Low-code
Related Products
When introducing a data lake, it can be difficult to make progress in utilizing data because you don't know where and in what form the diverse data is stored. A data catalog is a way to "see" what data is stored where. If you're interested, please see below.
DataSpider trial version and free online seminar
"DataSpider," data integration tool developed and sold by our company, also has ETL functions and is data integration tool with a proven track record as a means of supporting the utilization of DWH.
Unlike regular programming, development can be done using only the GUI (no-code), without writing any code, and it offers "high development productivity," "full-fledged performance that can serve as the foundation for business (professional use)," and "ease of use that can be used by those in the field (even non-programmers can use it)."
It can smoothly solve the problem of "connecting disparate systems and data," which is hindering not only data utilization but also the successful utilization of various IT technologies such as cloud computing.
We offer a free trial version and hold online seminars where you can try out the software for free, so we hope you will give it a try.
Glossary Column List
Alphanumeric characters and symbols
- The Cliff of 2025
- 5G
- AI
- API [Detailed version]
- API Infrastructure and API Management [Detailed Version]
- BCP
- BI
- BPR
- CCPA (California Consumer Privacy Act) [Detailed Version]
- Chain-of-Thought Prompting [Detailed Version]
- ChatGPT (Chat Generative Pre-trained Transformer) [Detailed version]
- CRM
- CX
- D2C
- DBaaS
- DevOps
- DWH [Detailed version]
- DX certified
- DX stocks
- DX Report
- EAI [Detailed version]
- EDI
- EDINET [Detailed version]
- ERP
- ETL [Detailed version]
- Excel Linkage [Detailed version]
- Few-shot prompting / Few-shot learning [detailed version]
- FIPS140 [Detailed version]
- FTP
- GDPR (EU General Data Protection Regulation) [Detailed version]
- Generated Knowledge Prompting (Detailed Version)
- GIGA School Initiative
- GUI
- IaaS [Detailed version]
- IoT
- iPaaS [Detailed version]
- MaaS
- MDM
- MFT (Managed File Transfer) [Detailed version]
- MJ+ (standard administrative characters) [Detailed version]
- NFT
- NoSQL [Detailed version]
- OCR
- PaaS [Detailed version]
- PCI DSS [Detailed version]
- PoC
- REST API (Representational State Transfer API) [Detailed version]
- RFID
- RPA
- SaaS (Software as a Service) [Detailed version]
- SaaS Integration [Detailed Version]
- SDGs
- Self-translate prompting / "Think in English, then answer in Japanese" [Detailed version]
- SFA
- SOC (System and Organization Controls) [Detailed version]
- Society 5.0
- STEM education
- The Flipped Interaction Pattern (Please ask if you have any questions) [Detailed version]
- UI
- UX
- VUCA
- Web3
- XaaS (SaaS, PaaS, IaaS, etc.) [Detailed version]
- XML
- ZStandard (lossless data compression algorithm) [detailed version]
A row
- Avatar
- Crypto assets
- Ethereum
- Elastic (elasticity/stretchability) [detailed version]
- Autoscale
- Open data (detailed version)
- On-premise [Detailed version]
Ka row
- Carbon Neutral
- Virtualization
- Government Cloud [Detailed Version]
- availability
- completeness
- Machine Learning [Detailed Version]
- mission-critical system, core system
- confidentiality
- Cashless payment
- Symmetric key cryptography / DES / AES (Advanced Encryption Standard) [Detailed version]
- Business automation
- Cloud
- Cloud Migration
- Cloud Native [Detailed Version]
- Cloud First
- Cloud Collaboration [Detailed Version]
- Retrieval Augmented Generation (RAG) [Detailed version]
- In-Context Learning (ICL) [Detailed version]
- Container [Detailed version]
- Container Orchestration [Detailed Version]
Sa row
- Serverless (FaaS) [Detailed version]
- Siloization [Detailed version]
- Subscription
- Supply Chain Management
- Singularity
- Single Sign-On (SSO) [Detailed version]
- Scalable (scale up/scale down) [Detailed version]
- Scale out
- Scale in
- Smart City
- Smart Factory
- Small start (detailed version)
- Generative AI (Detailed version)
- Self-service BI (IT self-service) [Detailed version]
- Loose coupling [detailed version]
Ta row
- Large Language Model (LLM) [Detailed version]
- Deep Learning
- Data Migration
- Data Catalog
- Data Utilization
- Data Governance
- Data Management
- Data Scientist
- Data-driven
- Data analysis
- Database
- Data Mart
- Data Mining
- Data Modeling
- Data Lineage
- Data Lake [Detailed version]
- data integration / data integration platform [Detailed Version]
- Digitization
- Digitalization
- Digital Twin
- Digital Disruption
- Digital Transformation
- Deadlock [Detailed version]
- Telework
- Transfer learning (detailed version)
- Electronic Payment
- Electronic Signature [Detailed Version]
Na row
Ha row
- Hybrid Cloud
- Batch Processing
- Unstructured Data
- Big Data
- File Linkage [Detailed version]
- Fine Tuning [Detailed Version]
- Private Cloud
- Blockchain
- Prompt template [detailed version]
- Vectorization/Embedding [Detailed version]
- Vector database (detailed version)
Ma row
- Marketplace
- migration
- Microservices (Detailed Version)
- Managed Services [Detailed Version]
- Multi-tenant
- Middleware
- Metadata
- Metaverse
Ya row
Ra row
- Leapfrogging (detailed version)
- quantum computer
- Route Optimization Solution
- Legacy System/Legacy Integration [Detailed Version]
- Low-code development (detailed version)
- Role-Play Prompting [Detailed Version]

