New Developer Blog Series Vol. 4 - Data Visualization and Analysis
Vol.04 Thinking about data quality
This is the fourth installment of this series.
- Vol.1 Data visualization is XXX
- Vol.2: Is XX included in the analysis? - Major Edition -
- Part 3: Vol.3 Is XX included in the analysis?② - Dimensions -
This episode is self-contained and covers important points for understanding the current and future issues surrounding data.
I believe this will also provide clues as to the current position of data-related services on the market and what types of services will emerge in the future.
The intended audience is anyone who works with data.
I hope you will take a look at it.
Dashboards that are created but never used
There are many dashboards out there that have been created but are no longer used.
There are cases where dashboards have outlived their usefulness because they "no longer fit the reality due to changes in the business environment or reforms to the organizational structure," but in reality, dashboards are being created every day that have only recently been released but are not being used at all.
The reason behind the creation of such dashboards is clear.
This isn't a matter of "the font is round gothic for some reason" or "it's too busy, like a bureaucratic PowerPoint presentation."
The data quality is low
If you think it's obvious that a buggy dashboard is useless, wait a moment.
It's not about bugs.
What is data quality?
A document called the Data Quality Management Guidebook has been published on the Government CIO Portal.
I would like you to take a particular look at section 2.3 Data Evaluation.
This section describes how to evaluate the quality of the data itself from 11 perspectives in line with ISO/IEC 25012.
- 1. Accuracy
- 2. Completeness
- 3. Consistency
- 4. Credibility
- 5. Currentness
- 6. Accessibility
- 7. Standards Compliance
- 8. Confidentiality
- 9. Efficiency
- 10. Precision
- 11. Traceability
- 12. Understandability
- 13. Availability
- 14. Portability
- 15. Recoverability
For example, "4. Credibility" and "11. Traceability" include an evaluation item such as "Is the source of the data clearly stated?"
Let's say you have a dashboard that lets you check the income and expenditures of a project, including the selling price, cost, and gross profit margin.
If you are someone who regularly tracks income and expenses, you will probably want to first check the definition of "cost."
Even the word "cost" has no single definition, and the correct answer varies depending on the situation.
This is not just something that happens to our company.
Do you want to see breaking news or confirmed news, progress or expected outcomes?
If you want to see the likely outcome, should you look at the pro rata or the vibrations? The correct answer will differ depending on the person and the situation.
Data quality cannot be guaranteed simply by accuracy, such as "logically correct."
- *Here, we have used "4. Authenticity" and "11. Traceability" as examples, but this does not mean that they are more important than other aspects.
All perspectives are equally important.
Data Catalog Approach
Products in the Data Catalog category are equipped with a function called Data Lineage (including, of course HULFT DataCatalog).
This function allows you to track the path that data has taken from its creation to the present, and any changes made along the way.
This feature is one solution to the problem with the "cost" example above.
By tracing and showing the path and changes from the source of data generation, it is possible to satisfy the evaluation items "4. Credibility" and "11. Traceability," which state whether the origin of the data is clearly stated.
The general perception of Data Catalog is that it is a tool for searching for data.
If you read the Data Quality Management Guidebook and then look at the features, you may see things differently.
For example, if a customer has an issue with "1. Accuracy," we can approach that issue through Data Catalog.
However, on the other hand, please remember that
The key point is that the piece that satisfies the "4. Credibility" aspect does not have to be a Data Catalog.
It is important to ensure that each quality aspect is guaranteed across the entire internal data infrastructure.
There is a Data Catalog service called Amazon DataZone that was announced at the recent AWS re:Invent.
This service uses a component called "Publish/subscribe workflow with access management."
I was surprised to see that Data Catalog approached the quality perspective of "8) Confidentiality."
So, does this mean that Data Catalog will need to adopt an approach to "8) Confidentiality" in the future? Not necessarily.
This "8) Confidentiality" is generally controlled by DWH role design or dynamic masks.
And in many cases, that's fine.
To reiterate, it is important to ensure quality across the entire internal data infrastructure.
When a customer selects pieces of their internal data infrastructure using Best of Breed, there are some quality aspects that cannot be guaranteed when looking at the overall picture.
When we think about who will fill these gaps, it seems like there is a lot for our company to do.
Ensuring data quality
From here, I would like to talk about ways of thinking about ensuring data quality and some specific examples of measures to take.
It's someone else's fault so there's no problem!
For example, due to an issue with the interface source system, an error occurred in the figures reflected on the dashboard.
Alternatively, a malfunction in the company's internal data infrastructure could be delaying the latest figures from being reflected on the dashboard.
In cases like this, it's a relief to investigate and find out that it's someone else's fault.
But that's not the case in data engineering.
Deterioration in data quality becomes apparent to users through user interfaces such as dashboards.
So, no matter who is to blame, it is the dashboard that will lose credibility and fall into disuse.
The same is true for machine learning; no matter how great a model is, it will not be evaluated if the results are strange.
Even if the cause is a bug in a source system developed by another vendor.
Data engineering is evaluated based on the results it produces.
Data engineers should not only be conscious of the quality of the results of BI and machine learning, which are the output of the company's data infrastructure,
We also need to keep an eye on the process and the path.
No matter how much you say "it's someone else's fault," it just rings hollow.
(I'm writing this calmly, but it's incredibly infuriating when you experience it.)
Resilience and Observability
One approach to disaster preparedness that is prevalent in the cloud world is the pursuit of resilience.
The idea is not to try to prevent failures, but to design systems with resilience and adaptability in mind, with failures occurring in mind.
The key to this is observability: keeping track of what is happening, when, and where.
This idea can also be applied to data.
In other words, rather than preventing degradation of data quality, the idea is to keep in mind that degradation will occur and to make the data resilient and adaptable.
It may seem counterintuitive, but if it says, "Don't look at this dashboard right now," or "Don't use this data for analysis right now," people won't trust the dashboard or the data.
Taking these points into consideration, the approach to ensuring data quality is as follows:
"We will monitor the current state of the data and notify users as necessary."
This is the focus.
To do this, it is important to consider what observability means in data.
Observability in Data
Regarding observability in data, the article 5 Pillars of Data Observability is close to my thinking.
The five pillars are summarized below:
- Freshness
- Distribution
- Volume
- Schema
- Lineage
Please understand that these are monitored and data quality is evaluated from the 11 perspectives mentioned above.
For example, a common incident that occurs in internal data infrastructure is when BI updates begin before data storage in the DWH is complete.
This can be detected by monitoring freshness when updating BI and evaluating "5. Recency" in the Data Quality Management Guidebook.
Tableau and PowerBI have data-driven alert (data alert) functionality, so this can be implemented using BI alone.
If data is lost due to a failure in the source system, this can be detected by monitoring the Distribution and Volume and evaluating "2. Completeness."
In one example of an implementation we have done for a customer, we created a dedicated monitoring pipeline using ETL to observe the DWH.
The question is how to provide feedback to users based on the results of actual monitoring and evaluation.
"We'll have a bulletin board on the dashboard to display a message if there are any problems with the quality of the data."
I think an approach like this would be good.
Systems are also used to automatically post information on dedicated portals, etc.
Even if it's not this big,
- Displaying the data update date and time
- Specify the source of the data/prepare a page explaining the calculation logic
This can be implemented in the dashboard alone, so it's a good idea to do this when creating a dashboard.
Data Validation Tool (DVT) product family
In this field of observation and evaluation, products and services categorized as Data Validation Tools (DVTs) are beginning to emerge.
I haven't heard of anyone around me using it, but I think I'll be hearing the name more and more in the future.
In particular, the following evaluation criteria can be determined to a certain extent based on rules, so I think the use of tools will also increase.
- 1. Accuracy
- 2. Completeness
- 3. Consistency
- 7. Standards Compliance
- 10. Precision
lastly
So far we have discussed data quality challenges and how to address them.
The fact that these issues have become apparent suggests that the maturity of data utilization has reached a certain level.
I also suspect that there are many cases where people are aware of the issues but don't know how to deal with them.
I hope this article will provide some hints in such cases.
The Data Quality Control Guidebook states the following:
The evaluation of the data itself is only a snapshot at the time of evaluation, and in order to maintain high-quality data, the data must be continually updated. Even if high-quality data is provided at one time, it is meaningless if the quality of the data subsequently declines. For this reason, it is important to repeatedly check the quality of the data.
It is important to continuously monitor data quality.
I have of course written this article with that in mind, but I have included it at the end because it is an important mention.
The activity of data engineering will continue as long as corporate activities continue.
As a company continues its activities, the business environment changes both internally and externally, so the content of the data also changes, and the way it is interpreted also needs to change.
Data engineering, along with SaaS, is probably the field that most requires a DevOps-like system and mindset.
In this environment, a system that can continuously monitor and evaluate data quality is likely to become a fundamental system similar to CI/CD in DevOps.
If you have read this far, you may now have a better idea of the current situation and future of data.
Thank you for reading this long article to the end.
Next time I'll try to write about something a little different.
