New Developer Blog Series Vol. 2 - Data Visualization and Analysis
Vol.02Is XX included in the analysis? - Major Edition -
This is the second communication regarding visualization and analysis.
Did you have a look at the previous article?
Dimensions and Measures
Before getting into the main topic, I would like to provide some background information.
In the field of data engineering, there are terms called dimensions and measures.
Dimensions are qualitative information that serve as analytical angles.
Measures are quantitative information that serve as indicators in analysis.
For engineers, the items specified in the Group By clause when writing SQL are dimensions.
It may be easier to understand if you think of measures as items handled by aggregate functions (aggregation functions) such as SUM() in the SELECT clause.
Imagine filling out a survey.
At the beginning or end of the test, you will be asked to fill out a face questionnaire in addition to the actual questions.
Face items are your attribute information such as age, gender, place of residence, company name, and job title.
This face item corresponds to a dimension.
And apart from those, I think you will be asked questions that you actually want to ask.
For example, if a question asks about satisfaction on a 5-point scale, that answer would be a measure.
When analyzing, rather than just vaguely looking at the overall picture, we cut measures into dimensions.
In the example of a survey, you might derive a result such as "The general public is generally satisfied, but only men in their 30s are satisfied," which is what it means to cut a measure using a dimension.
In business figures, organizations and the like are dimensions, and sales and the like are measures.
Points that are often overlooked
In the early stages of analysis, there are some points that are often overlooked for both measures and dimensions.
I originally intended to cover both in this article, but once I started writing it ended up being quite long.
I will be splitting this into two parts, and this time I would like to focus on the major points.
Now, let's get to the main topic.
Does the analysis involve division?
When we interpret numbers that we see every day, we implicitly perform division to "convert" them.
Let's assume that, without any prior knowledge, you look at the number of coronavirus infections by prefecture on a certain day.
This makes it appear as though the virus is prevalent in Tokyo, and it probably looks that way no matter what day you look at it.
The reason you don't misread it like this is because you already know that Tokyo has an extremely large population.
How many people know which prefecture has a larger population, Oita Prefecture or Yamagata Prefecture?
If you look at the number of infected people per day without knowing the actual number, you can't make any inferences.
In such cases, we use division to convert the numbers so that they can be interpreted even without prior knowledge of the population by prefecture.
For example, "number of infected people per 100,000 people."
This is something we do naturally when looking at business figures.
When analyzing revenue, I don't think you would ignore profit margins and make a decision based solely on the absolute amount of profit, and when it comes to labor costs, you would first look at indicators such as the labor cost to sales ratio and labor distribution rate.
In our business, we have traditionally used indicators such as labor-hour productivity and sales per employee.
Analysis is performed using not only real numbers but also measures converted by division.
When I write like this with examples, I realize that I am making a very obvious statement.
As I wrote repeatedly last time, the first step is to put into words even things that are taken for granted and to have a common understanding.
This obvious fact is easily forgotten when faced with unfamiliar numbers and attempt to begin analysis.
The trap of forgetting the obvious
For example, if you were suddenly tasked with conducting customer analysis, the first thing you would do is search for information on the method.
This will take you to a page that introduces decile analysis and RFM analysis.
Looking inside seems like it would be useful, so I will begin my analysis based on this.
Through trial and error, we were able to arrive at the result that "customers with higher ranks spend more in total."
As some of you may have noticed, this is not an analysis.
Generally, people who spend a lot of money are considered to have a higher customer rank, so this simply flips the relationship arrow.
This may seem like a funny story at first glance, but this kind of thing really does happen in the field of analysis.
No one is at fault here, as the sites returned in the search results only explain the first steps.
Since this is the first step, we will explain the analysis using simple measures, but this is just the starting point.
The next step is to try replacing the measure you are currently analyzing with another measure.
At that time, try mixing in measures created by division.
For example, if you were looking at purchase amounts by customer rank, try converting that to average customer spending.
Since it is the average customer spending, it is an indicator obtained by dividing the total amount of a customer's purchases by the number of times they visit the store.
As a result, if we find that there is no difference in average customer spending between high-ranking customers and others, then the difference between high-ranking customers and others is the frequency of visits to the store (since high-ranking customers should spend more in total), and we can hypothesize that measures such as rewards based on the number of visits to the store may be more effective than marketing measures such as bulk buying campaigns.
I think this example is a little too simple, but I hope it gives you a sense of what I'm trying to convey.
Answer to the previous question
In the previous article, I left the question at the end: "Why is height and weight listed but BMI not?"
If you've read this far, you already know.
For example, if you are trying to analyze obesity and use weight as an indicator, you cannot eliminate the influence of height.
To give an extreme example, what's the point in comparing the weight of a man in the Netherlands, where the average height is 184cm, with a man in Japan, where the average height is 171cm? I'm sure everyone is wondering that.
First, I asked this question to gain perspective on whether it contains a measure made by division.
I hope you will keep this in mind when dealing with analysis.
Application to domain knowledge
By gaining this basic understanding, you can also deepen your understanding of domain knowledge (or business knowledge).
For example, in the retail industry there is a concept called PI.
PI stands for Purchase Indicator and is the number of items sold per 1,000 customers (made by division).
When ordering food, we consider how many items we can sell before the products arrive, so we make a sales forecast.
Since we order hundreds or thousands of items, it's not realistic to forecast all sales.
So what we do is predict the number of customers.
The number of customers is calculated on a store-by-store basis, so the store manager makes predictions taking into account weather, events, etc.
Next, you can predict sales for each product by multiplying the predicted number of customers by the PI calculated from past performance.
For example, by using PI instead of sales volume as a measure in time series analysis, it is possible to analyze purely from the perspective of changes in product appeal, eliminating the influence of customer numbers.
In the transportation industry, there is also the concept of cost per unit.
It's simply shipping cost / cargo volume (I know I'm repeating myself, but it's done by division).
In the transportation industry, areas are frequently reorganized and sorting centers are restructured to improve transportation efficiency.
This can result in situations like, "Even if the overall amount of luggage doesn't change much, when you look at it by area, the amount of luggage can be very different between last month and this month."
From the perspective of cost analysis, this effect needs to be eliminated, so we will proceed with the analysis using unit costs.
*Transportation efficiency is also affected by the size of the cargo, so another way to think about it is cost per volume.
There are limits to comprehensive knowledge of each of these domains.
However, even if you don't know it technically, by understanding that "they must be doing division to convert something or eliminate the effect," you will be able to imagine whether there is an appropriate indicator.
Last time, we mentioned that when people are still in the process of maturing with data, they often have a sense of the numbers but are unable to explain that sense logically.
Part of the reason for this may be related to whether or not this division is performed.
This is a case where "people with experience have implicit prior knowledge and look at numbers with that knowledge."
In the example of Oita and Yamagata prefectures mentioned above, if you know the population of both prefectures, you can gain some insight just by looking at the number of infected people.
In this case, the approach is to keep in mind that division will be used and to find the divisor (number to divide) that best matches the intuition.
Even if you don't find the answer, by making an effort to get closer you will be able to come up with a somewhat logical explanation.
And this approach...those with a good sense of intuition may have realized what I'm trying to say.
It is nothing more than a simplified version of machine learning done in an analog way.
If I continue like this, I'll probably keep writing forever, so I'll end it here for now.
Next time, as mentioned above, I will touch on the point about dimensions. I hope there are people waiting for that.
