Is this the era of decentralized data management? What is the next-generation data infrastructure that realizes data mesh?

Shinnosuke Yamamoto

7 minutes to read

As companies make greater use of IT, the amount of data they hold is increasing and the formats are becoming more diverse and complex. Along with these changes, the idea that data should be managed in a decentralized manner has emerged, as opposed to the traditional mainstream approach of managing data in a centralized location.
In this article, we will use the concept of "data mesh" as an example to explain what the next-generation data infrastructure should look like.

Distributed Data Management: What is a Data Mesh?

Data mesh is a new data management method that has been advocated in recent years, in which data is managed in a distributed manner rather than centralized in one place.

Data mesh is based on the idea of managing and utilizing data by domain, or by business unit. Each department is responsible for its own data, and it is expected to enable quick data utilization and decision-making.

Let's look at four principles that characterize data meshes.

Principle 1: Local ownership

The department that owns the data best understands how that data was generated and what business value it has. This has the advantage of making it easier to respond to user inquiries and maintenance requests quickly and accurately. Ideally, each department in the field will organize their own data, set access controls and disclosure rules, and create a system that allows it to be shared company-wide. This allows data to circulate without going through a central department, and data utilization can be promoted while maintaining the speed of business.

However, if one department holds too much data, it can lead to the drawback of other departments being unable to utilize it. Data mesh requires a mechanism for sharing data with other domains as needed while still fulfilling the owner's responsibilities. Furthermore, if rules are not standardized, there is a risk that data will quickly fall into partial optimization and lose reusability across the company, so it is essential to establish a cross-departmental governance system.

Principle 2: Data as a product

Data mesh treats data not as a mere by-product of business, but as a valuable deliverable for users, i.e., a product. As a product, its quality, ease of use, and delivery format must be clearly defined, and it must be documented so that users can easily use it.

Specifically, data owners will need to think from the perspective of data users and provide documentation, sample queries, best practices for use, etc. Also, by making data accessible through APIs, anyone can handle the data with a consistent level of quality.

This will make it easier for data users to make optimal decisions and is expected to help them gain insights that will lead to business development. At the same time, productizing data will also serve as a driving force for establishing a culture of continuous data quality improvement.

▼I want to know more about the API
⇒ API｜Glossary

Principle 3: Self-service infrastructure

Business personnel and data scientists who want to utilize data tend to prefer a system that allows them to quickly retrieve data as needed, rather than having to understand in detail the departments and systems to which the data belongs. If users can freely search and access the data they need, they can carry out analyses and plan measures without slowing down the pace of business operations.

To make self-service a reality, it is important for data providers to publish APIs and create an environment where users can smoothly obtain data. Ideally, published APIs would be constantly monitored, with quality evaluation and operational status visualized. Furthermore, a data catalog that centrally manages data-related information would make it possible to see which APIs provide what data, making collaboration between departments even easier. Establishing such a system will lay the foundation for data utilization throughout the entire company.

Principle 4: Transversal Governance

One of the major risks of decentralized management is that data becomes siloed again. If each department starts using its own naming conventions and quality standards, it will become difficult to collaborate across departments. Even if each department owns the data, establishing a unified format and security rules will make it easier to reuse data and ensure its quality.

By having each department follow these standards, data integration and mutual use will become smoother, and duplicate development and processing errors can be minimized. This standardization does not mean centralized management, but rather should be thought of as the establishment of a common language to support decentralized management. Ultimately, the benefit is that each department can autonomously publish data while still being able to use it consistently at a company-wide level.

The Rise of Data Mesh: The Limits of Centralized Control

The emergence of data mesh stems from a variety of challenges with traditional data management. The amount of data handled by companies continues to grow every day, and the types of data are becoming more diverse. In this situation, problems have become apparent that cannot be fully addressed through centralized data management.

Data silos and scattered data

Within a company, data tends to fall into a data silo state due to the division of organizations and systems. It is common for different departments to manage different types of data, such as the sales department for customer data and the production management department for product data. Furthermore, it is not uncommon for the same type of data to be managed across multiple departments.

In addition, data is scattered across various system environments, including on-premise, cloud, and SaaS. This means that sharing data between departments or systems requires coordination each time, and it requires a lot of work, such as processing data formats and considering security requirements. As a result, it becomes difficult to quickly utilize the data you need.

The limitations of a centralized approach

To address these challenges, centralized approaches such as data lakes and data warehouses have been adopted to consolidate data in one place.

Data Lake: A data lake is a temporary storage facility for large amounts of data, regardless of format, for flexible analysis and machine learning.
Data warehouse: By arranging and integrating data collected from various systems and storing it in a format specialized for analysis, it is possible to quickly create reports using BI tools, etc.

However, simply collecting data in one place can lead to new problems. When a huge amount of data accumulates without being properly managed, it can become a data swamp, making it difficult to find the data you need. If the data is of low quality and there are duplications and inconsistent updates, its usefulness will be significantly reduced.

▼I want to know more about data lakes
⇒ Data Lake | Glossary

Difficulty in dealing with accelerating data growth

In today's business environment, the types and amounts of data handled by companies are increasing exponentially due to AI, IoT, external collaboration, and other factors. This has made it extremely difficult to constantly maintain the huge data infrastructure and maintain a unified data model across the entire company using traditional centralized management approaches. With limited expertise and operational resources, companies are finding it difficult to keep up with the speed and flexibility required for data utilization.

These limitations of centralized management are a major reason why a new approach called data mesh is gaining attention.

What is the data integration platform needed to achieve SX?

When a company introduces a data mesh, it is essential to design an operational structure and governance in addition to simply introducing tools and cloud platforms. Particularly significant challenges include how to delegate ownership between departments and unify data quality standards. Here, we will consider key technologies that support distributed management, such as iPaaS, no-code tools, and data catalogs.

▼I want to know more about iPaaS
⇒ iPaaS | Glossary

Integrating Data with iPaaS

iPaaS (Integration Platform as a Service) is a service that enables the integration of multiple systems and applications.data integrationiPaaS is a cloud service for centrally managing data. By combining the APIs and connectors of each system, it becomes easier to automate the extraction, transformation, and loading of data. This simplifies the organization of ETL and ELT systems that were previously built separately, and makes governance easier to address. In a data mesh environment, iPaaS plays a major role as an important foundation for bridging departments.

▼I want to know more about ETL
⇒ ETL｜Glossary

Standardized common interface environment

Introducing different ETL tools for different departments results in fragmented data conversion processes and job management, making unified operational management difficult. This increases the risk of multiple conversion logic being used on the same data set, resulting in inconsistent results.

By standardizing interfaces and standardizing components in a common interface environment, data acquisition and update procedures can be unified across systems, making it easier to maintain uniform security levels and data quality standards across the entire company.In addition, the use of common components allows for smooth integration with new systems, enabling rapid response to business changes.

As data utilization advances, the number and types of interfaces between systems increase, and management costs also rise.By establishing a standardized interface, the effort required for data integration can be minimized, even if different tools are introduced by different departments.

No-code, on-site-led interface development

By utilizing no-code tools, even business personnel with little programming knowledge can easily set up data integration and APIs. If field personnel can set up data integration themselves, the lead time from requirements definition to implementation will be significantly reduced. Even with limited IT resources, it will be easier to quickly reflect department-specific tools and data in the usage environment.

Furthermore, if each department can publish their data in API format without coding, collaboration with other departments and external partners will become smoother. Users can access data via API regardless of programming language or BI tool, so there is no need to be concerned about differences in technology stacks. This will greatly expand the scope of data utilization, allowing business users to quickly create analytical dashboards and automatically connect with external services.

At the same time, providing standardized templates and guidelines minimizes unregulated development and security risks, and allows you to practice the data mesh principle of "local ownership" while maintaining company-wide control.

Centralize data management with a data catalog

To efficiently handle distributed data assets, a system is needed to visualize which department owns what data. By introducing a data catalog, metadata, ownership information, access methods, and more can be consolidated in one place, creating an environment where users can easily search and reference data. In data mesh, instead of centralizing the actual data itself, this catalog is used as an "information hub for distributed management."

Exploring Data in a Data Mesh

As a company grows in scale, the number of departments that hold data and the systems they use increases, making the management structure more complex. Data Mesh assumes this situation and adopts a strategy of effective governance without physical integration.

When data is owned by each business site, it can seem like it would be a lot of work to find the information you need. Therefore, it is important to position the data catalog as a company-wide dictionary so that anyone can easily search for the data they need.

This will create a system that does not impede company-wide data utilization without compromising the speed and flexibility that decentralized management brings.

What is a Data Catalog?

A data catalog is a solution that aggregates and manages metadata for data scattered inside and outside a company, making it easy for those who need it to access it when they need it. By registering table structures, column meanings, data owners, use cases, and more, it becomes possible to get a bird's-eye view of the data assets of the entire organization.

By using a data catalog, data can be logically managed in a centralized manner, even if the data is physically stored in different locations or has different owners. This is not a system for collecting data in a central location, but rather a system that allows users to easily understand the existence and content of the data. While the field departments retain ownership, it makes it easier for anyone across the company to access the data they need, significantly improving the efficiency and quality of data utilization.

Users who find the data they need through the data catalog can freely access the data using APIs and ETL tools. This significantly reduces bottlenecks in approval processes and inquiries, allowing them to spend more time on strategic data utilization rather than simply matching numbers.

summary

As business environments become more sophisticated and diverse, distributed management such as data mesh is gaining attention. By combining field-led ownership, product orientation, and a self-service platform, it is possible to build a new data utilization system that combines speed and flexibility.

No-code iPaaS and data catalogs are useful for establishing a data mesh-like architecture that makes it easy to maintain governance even in a distributed environment. With the explosive growth of data and the rapid expansion of usage scenarios expected to continue, this distributed management approach will likely become an indispensable option for many companies.

The person who wrote the article

Affiliation: Data Integration Consulting Department, Data & AI Evangelist

Shinnosuke Yamamoto

After joining the company, he worked as a data engineer, designing and developing data infrastructure, primarily for major manufacturing clients. He then became involved in business planning for the standardization of data integration and the introduction of generative AI environments. From April 2023, he will be working as a pre-sales representative, proposing and planning services related to data infrastructure, while also giving lectures at seminars and acting as an evangelist in the "data x generative AI" field. His hobbies are traveling to remote islands and visiting open-air baths.
(Affiliations are as of the time of publication)