Developer Blog Series Vol. 7 - Data Visualization and Analysis

Data Utilization

Ryota Kosaka

Vol.07 Classifying Data

This is the seventh article on data analysis and visualization.

This time, and next time, I would like to write an article focusing on "data" itself.

The term "data" can mean a variety of things.
For example, when we look at the structure of data, there is a division into structured data and unstructured data.
In terms of data formats, CSV and JSON are typical, and Parquet and Avro are also commonly used in data engineering.

What about the content of the data?
Leaving aside the business implications of data for the moment, I thought I would summarize how data is organized in data engineering and data management.

This time, I will write about data classification, mainly from the perspective of data management, with reference to Chapter 10 of the DMBOK (Data Management Body Of Knowledge).
Next time, I plan to write about data layering from a data engineering perspective.

Master Data and Transactional Data

Probably the most familiar and traditional classification for most people is master data and transactional data.
(Although I've only seen this in documentation) It is sometimes referred to as dimension data and fact data.

There are many definitions available if you search, so I won't explain them here.

In enterprise system development, I think that data (or in most cases, RDB tables) is often classified as either master or transactional.

In the fields of data engineering and data management, we need to make more detailed classifications.

Six-layer data taxonomy

Malcolm Chisholm defined a six-tiered data taxonomy.
That is what is shown in the table below.

1	Reference Data	A table with codes and meanings that represent those codes, etc.
2	Enterprise Structure Data	Used for reporting business activities such as charts of accounts
3	Transaction Structure Data	Customers, products, vendors, etc. that must be present when a transaction occurs
4	Transactional Business Data	Record transaction details
5	Transaction Audit Data	Describes the state of a transaction
6	Metadata	Describing the data

In Chisholm's definition, items 1 to 3 are master data, and items 4 and 5 are transactional data.

First, a bit more about transaction data.
Transactional business data (4) is what we generally imagine as transactional data.
Transactional business data (5) is data that tracks the lifecycle of a transaction, including server logs.

Next, let's talk about master data.
First, regarding 2. Enterprise structure data and 3. Transaction structure data, I personally do not see any significance in clearly distinguishing between them, and there is no mention of this in the DMBOK.
(I can understand the feeling of it being a good classification.)

On the other hand, there is clearly meaning in distinguishing between reference data 1 and 2 and 3.

Chisholm's definition includes 1 as master data, but DMBOK calls 2 and 3 master data.
From this article onwards, we will refer to 1 as reference data and 2 and 3 as master data.

Reference and Master Data

When trying to achieve consistency of master data assets through MDM (Master Data Management), the focus of data management differs between master data and reference data.

The DMBOK states the following as the focus of data management:

Reference data provides organizations with access to an entire list of accurate and up-to-date values.
Master data ensures that the most accurate and up-to-date data about critical business entities is consistently available across systems.

At first glance, it's hard to tell what's different.

Reference data is often managed in tables such as category value masters or setting masters.
Category 1 is xx, 2 is xx, etc. These do not change frequently.
So the goal is to share accurate and up-to-date data.

On the other hand, it is assumed that master data items 2 and 3 are subject to change, and in practice the concept of dates also comes into play.
If you think about organizational changes, you can understand that master data needs to be consistent.
As of April 1st, it is impossible to know where I belong and what my superior organization is unless I combine multiple data sets in a consistent manner. Consistency is the key point.
This is expressed in the DMBOK as "entity resolution."

So from an MDM perspective, the main focus is on making consistent data available across systems with entity resolution for master data.

On the other hand, it may be good to remember that when it comes to managing metadata, the main target is reference data.
In addition to indicating the meaning of the reference data code, we also want to manage information such as which data it complies with.
For example, for reference data such as prefecture codes and prefecture names, this would also manage metadata such as "compliant with JIS X 0401."

Data-driven and data

The drive towards becoming a data-driven organization has focused heavily on transactional data.
On the other hand, it is said that how much transaction data can be utilized depends largely on the quality of the reference data and master data.
The importance of MDM can be seen from the fact that the framework known as MDM has emerged and the DMBOK devotes an entire chapter to reference data and master data.

When we talk about a data-driven organization, one of the keywords that comes to mind is a situation in which everyone is familiar with data, in other words, the democratization of data.
In the next article, I will discuss data layering and also consider the reasons why data democratization is not progressing as expected.

Thank you for reading to the end.

The person who wrote the article

Affiliation: Deputy General Manager of DP Development Department 1, DP Management Division, DI Headquarters, and Manager of DP Development Section 1 (affiliation is current at the time of publication)