Developer Blog Series Vol. 11 - Data Visualization and Analysis -

Data Utilization

Ryota Kosaka

Vol.11 Terms and Concepts Surrounding Data - Modern Data Stack -

This is the 11th article on data analysis and visualization.

This article is more of a continuation of the previous one, explaining the words and concepts surrounding data.
The message common to this and the previous article is to separate concepts and implementations and to understand the concepts properly.

There are a lot of advanced and sophisticated thinkers emerging in the data industry, but do we really need dedicated tools to implement these concepts? That's what I'm trying to say.
It is also important to understand this when promoting ETL and iPaaS.

Last time, I wrote about semantic layers and data virtualization using DataFabric as an example.
This time I will be focusing on the Modern Data Stack.

It would take a huge amount of writing to comprehensively explain the Modern Data Stack, so I hope this article will help you understand the following concepts first.

Reverse ETL Tools
Change Data Capture

What is the Modern Data Stack?

For an explanation of the Modern Data Stack, I will quote an article from IT Media.

As a fundamentally new approach to data integration, it features a modern data stack hosted in the cloud, with little or no technical configuration required by the user. Its key features are that the technical barriers to implementation and use are low, and it can be used without spending time on data connection and processing.

What is the Modern Data Stack? A new trend in data integration: Editorial column - ITmedia Enterprise

In other words, the trend is to easily create a data analysis infrastructure using the latest services on the cloud.

I had been watching it thinking it was just a buzzword, but it seems to be gaining more and more popularity than I expected, as newer service providers are jumping on the bandwagon.

If you start looking into what exactly it is, you'll be swamped by the providers' self-serving explanations, so first take a look at the following page to see what categories of services are calling themselves Modern Data Stack.

Reference: Categories - Modern Data Stack | Modern Data Stack

The following categories are listed here:

Data Workspace/ Collaboration
Data modelling & transformation
Data Warehouses
Feature Store
Event Tracking
Metrics Store
Business Intelligence
No code automation
Augmented Analytics
Operational Analytics
Operational analytics
Data Cataloging
Synthetic Data
Data Privacy & Governance
Spreadsheet based BI
Reverse ETL Tools
Data Lakes
Workflow Orchestration
Data Discovery
Business Reliability/Observability
Data Quality Monitoring
Data Streaming
PLG CRM
Change Data Capture
Data Mesh
Managed Data Stack
Product Analytics
Customer Data Platforms(CDP)
DataOps
Data Apps

As stated at the beginning, this article will focus on the following points and provide detailed information:

These two were selected not because of their importance, but because of their relevance to ETL.

Reverse ETL Tools
Change Data Capture

Reverse ETL Tools

AWS re:Invent 2022 from an Analytics Perspective
In an article I wrote about my participation in re:Invent in 2022, titled Scope of Analytics, I wrote the following:

This means that the process of feeding back the results obtained in SoI to SoR and SoE is automated.
This is called Operational Data Analytics, which is essentially Analytics.

Furthermore, in fiscal year 2023

AWS re:Invent from an Analytics Perspective (2023 Edition)

What we're saying is that analysis doesn't just involve collecting data on a data analysis platform and analyzing it, but also includes directly feeding the data back into business systems, etc.

Reverse ETL Tools enable this feedback data flow.
In contrast to ETL, which brings data into the analysis platform, Reverse is the process of returning data from the analysis platform to each system.

The term "Reverse ETL" is more of an architectural term, so it is sometimes called "Data activation" based on use cases.

It is said that the key to realizing Reverse ETL is to consider startup control (triggers and frequency) and differential updates.
Since this consideration cannot be achieved with traditional ETL, there is a call to use a dedicated Reverse ETL.

Is that really true?

Going back to the opening sentence, "Do I need specialized tools to implement the concept?"
The concept of Reverse ETL (Data activation) is essential.
However, no matter how you listen to the explanation above, it's not something that can't be implemented using ETL.

When introducing individual optimization of small components within a data analysis platform, problems inevitably arise as to how to orchestrate the system or how to obtain an overall overview.

Not just in analytical platforms, but even when we look at the history of IT, we see a shift in trends between decentralization and centralization, and between best of breed and monolithic.
Ultimately, it comes down to finding the right balance between orchestration and lock-in.

In particular, in the case of analytical platforms, two personalities, data engineers and data scientists, are involved, making it easier to move towards individual optimization.
When we look at it from the perspective of overall orchestration and where to place autonomy, I think ETL and iPaaS could play new roles.

Change Data Capture

In the same AWS article, I explain "A Zero ETL future..."
It was quite a shock when AWS announced Zero ETL at their Keynote.

The main part of this Zero ETL is a service called Amazon Aurora zero-ETL integration with Amazon Redshift.
The technology underlying this service is Change Data Capture (CDC).

CDC is a mechanism that tracks changes that occur in source data (mainly changes to the database) and uses them as triggers to perform some kind of processing.

The open source Debezium is often used, but this is a system that covers the part of "tracking changes that occur in the source system," so Kafka is often used for the part of "running some kind of processing."

Another example is Delta Live Tables, a key feature of Databricks, a fully managed analytics platform.

This is a feature that automates the gradual refinement of the data layer (Medallion Architecture in Databricks).
Data updates in downstream layers are achieved solely through CDC and DDL (Create Table).

Databricks explains that this eliminates ETL, and I think it's actually the most sophisticated approach to do so (as I wrote in the AWS article, this is only a localized discussion).

Previous Article Among them, "There are people in the world" ETL There are people who are shouting loudly, "We don't need that!" I wrote that
Among them are the service providers who have incorporated the CDC as a processing trigger.

If you've read this far, you probably understand to some extent, but the important part of CDC is "tracking changes that occur in the source system." In fact, that's the only thing CDC originally covers.

If we can get the ETL pipeline started using this CDC trigger, there won't be much difference between non-ETL and ETL processing.
In fact, CDC trigger functionality is gradually being implemented in cloud ETL.

With some ingenuity, it is possible to give existing ETL the same functionality.
However, in practice, there are many cases where a native mechanism for subscribing to events/messages and triggering their execution is desired.

*As an aside, what's great about Amazon Aurora zero-ETL integration with Amazon Redshift is not so much the CDC mechanism of "tracking changes that occur in the source system," but rather the part that follows, "running some kind of processing." The fact that this is optimized at the storage layer gives it an advantage over other mechanisms.

Conclusion

This time we looked at Reverse ETL Tools and Change Data Capture from the Modern Data Stack.
I am writing this article because I want people to be aware that there is a possibility that it will compete with ETL and iPaaS in the future in ways that may not appear to be competitors.

Again, while concepts are important, including the Modern Data Stack categories not covered here, implementation does not need to be in dedicated tools.
Please keep in mind that you will have a significant advantage if you promote ETL and iPaaS while also taking into account the orchestration perspective.

I would love for many people to participate in overseas events, experience things locally, and build on that experience in the future.

I hope this article helps someone.
Thank you for reading to the end.

The person who wrote the article

Affiliation: Deputy General Manager of DP Development Department 1, DP Management Division, DI Headquarters, and Manager of DP Development Section 1 (affiliation is current at the time of publication)