Developer Blog Series Vol.8 - Data Visualization and Analysis

Data Utilization

Ryota Kosaka

Vol.08Data Layer and In-Housing

This is the eighth article on data analysis and visualization.

Continuing from last time, we will talk about "data."
Although it is a continuation in terms of structure, this is the article I really wanted to write, so the amount of text is quite different.

As mentioned before, I will be talking about data layers from the perspective of data engineering. I will also be examining why in-house data utilization is not progressing, so I hope you will take a look.

Defining the Data Layer

Data changes form several times from the time it is generated to the time it is processed and used in BI, ML, etc.
The reason the data changes form is because it is not in a form that is easy to handle as data, or it is data that only becomes meaningful when combined with something else, so it is processed in stages.
The idea behind data layers is to define this as a layer and think about things for each layer.

First, here are some general layer definitions:

Although it is generally said, the names of each layer vary depending on who is saying it.
The reasons for layering also vary, from verbalizing concepts to organizing data storage.
However, the resulting organized layers are similar in all cases, so I will list them together.

Layer	alias	Content
Raw data layer	- Ingestion data layer - Landing data layer	As received from the data source
Curated data layer	- Conformed data layer - Cleansed data layer - Staging data layer	Items that have been prepared and cleaned and are ready for use
Application data layer	- User data layer - Production data ayer - Analytics data layer	BI, ML, business systems, etc. that are optimized for users

This is the general definition, but in practice, the definition in the Raw Data Layer seems a bit rough, so I think it would be better to further subdivide Raw only as follows.

Layer Breakdown	Content
Aggregated	Data collected from mission-critical system, core system, SaaS, etc.
Atomic	Data collected from IoT, system logs, etc.

Although data collected from mission-critical system, core system is treated as raw data from the perspective of the data analysis platform, it is not essentially raw data as it has undergone some kind of aggregation or processing.
On the other hand, data collected from IoT and other sources is essentially raw data.

Separation from the architecture

When I talk about data layers, some people respond by saying, "You're talking about Datalake / DWH / Datamart."
This is not an architecture issue, so I think it's best to think about it separately.

For example, I have not seen many companies that store aggregated data solely in Datalake.
On the other hand, atomic data is often stored only in Datalake (by issuing queries directly using the query service).

If we were to express the relationship between architecture and data layer in a matrix, I think the following pattern would be most common.

architecture	Services used, etc.	Raw data layer		Curated data layer	Applicatoin data layer
architecture	Services used, etc.	Atomic	Aggregated	Curated data layer	Applicatoin data layer
Datalake	S3, ADLS2, etc.	〇	〇 *1	-	-
Data-warehouse	Redshift, Synapse Analytics, Snowflake, etc.	-	〇 *1	〇 *2	-
Data-mart	〃(Cloud DWH) or RDB	-	-	〇 *2	〇

*1 Duplicate data is stored in both locations. *2 Which location is used for storing data depends on the situation.

The curated data layer is often placed in a data warehouse, but as the system is developed, it may also be placed in a data mart on a departmental basis.

Although it's a bit off topic, please also think about the architecture and the services used separately.
Data-mart can also be realized using a DWH service, so it is important to note that using a DWH service does not necessarily mean that it is a Data-warehouse architecture.

It is unclear what the best practice is for what to use for Data-mart.

(Old generation DWH services, etc.) Use RDB due to concurrency issues
Use a DWH service to address throughput issues with bulk data acquisition from BI tools, etc.

Both are very common.
New generation DWH services such as Snowflake have solved this problem by being designed to separate warehouses.

Reasons for the lack of progress in in-house development and the data layer

Even as the world moves towards in-house development, many customers are considering moving towards in-house development, particularly in the area of BI.
We are hearing that not only are IT departments developing their own systems in-house, but business departments are also moving in to developing their own systems in-house.

The first thing to do in this situation is to take training on BI tools.
BI tools are well-designed, so the basic operations themselves are not difficult.
However, when you actually try to handle the data in the field, you run into a wall.

This is because the data handled in actual work is not as clean as training data.

Data handled in actual work

For example, imagine aggregated data.
The data stored in the system is often stored in wide format (horizontal) to make it easier for users to view.
However, data that is easy to handle with BI tools is long (vertically held), so conversion (unpivot) may be necessary.
This is the simplest example, but data transformations like this happen all the time.

Eliminating duplicates and uniquely identifying data also occurs frequently.
Similar to SQL analytical functions (such as row_number()), this is a pattern in which you sort by a certain item and then use only the most recent results.
On the other hand, there are situations where it is impossible to express something without intentionally creating what is called a CROSS JOIN in SQL to increase the number of records.

It would be extremely difficult to overcome these challenges through in-house production by the business division.
Therefore, in order to move forward with in-house production, it is necessary to make these preliminary preparations.

In other words, data preparation is required in the Curated data layer or Application data layer.

Preparing the Curated Data Layer/Application Data Layer

So, does that mean that it is enough to first prepare the Curated data layer/Application data layer? No, that is not the case.

A curated data layer/application data layer cannot be established until the needs are understood to a certain extent.
This is because the basic concept of the data analysis platform is schema-on-read.
Schema on read is the principle that "the user of the data decides the shape of the data."

Therefore, the starting point for developing a Curated data layer/Application data layer is a bottom-up approach.
I think the key to speeding things up will be to switch to a top-down approach once we have a certain number of samples.

If timing is not taken into consideration, unused curated data layers will be created and in-house development will stagnate.

Taking steps in line with the maturity of DX

As I have mentioned several times in this series, I believe that DX has stages of maturity, and that it is necessary to take steps according to the level of maturity.

Where do you switch between bottom-up and top-down? Or how far do you go with freedom and where do you take control?
Our proposals will likely vary depending on the stage of DX maturity at which our clients are.

For example, in data utilization, the concept of the modern data stack is rapidly emerging.
The current modern data stack is very effective in helping companies with low DX maturity get off to a good start, but it is unclear whether it can meet all the demands of more mature companies.
It's not that there's a lack of functionality, but rather that there are some gaps that fall through the cracks.

What approaches can be considered to fill this gap? It could be an iPaaS or an ETL tool.
Of course, as you know, there are many cases where iPaaS and ETL are effective in accelerating initial momentum when DX maturity is low.
The issues involved in the former and latter are different, so I am wondering if it is possible to categorize them.

Last time, and this time, we have been organizing the data.
I study every day to understand the customer's situation from multiple angles and make the best possible proposals, including the quality of data written in the past and organizing data security requirements.

Thank you for reading to the end.

The person who wrote the article

Affiliation: Deputy General Manager of DP Development Department 1, DP Management Division, DI Headquarters, and Manager of DP Development Section 1 (affiliation is current at the time of publication)