Data cleansing techniques you need to know for efficient data utilization
Data cleansing is an essential step in utilizing data collected in daily business operations, but do you have a correct understanding of its necessity and methods? This article summarizes the data cleansing methods and target items required, especially when data integration between multiple business systems.
Data cleansing techniques
Data cleansing is the process of organizing data collected during business operations into data suitable for analysis for management decisions and sales strategies. Let's start by understanding the specific method step by step.
STEP 1: Analyze data quality and determine cleansing targets
When performing data cleansing, it is important to first observe the target data and determine what kind of "cleaning" is required. Identify items that need to be corrected, deleted, or organized, such as missing data, inconsistencies in notation, inconsistencies, incorrect entries, and duplications, and objectively analyze how "dirty" the data is.
This procedure is also important when considering specific work processes. Many people believe that the data stored in-house has few errors, but when you look at each item, it is not uncommon to find unexpectedly rough spots. In order to avoid making mistakes in the work process, it is important to thoroughly analyze and evaluate the quality of the target data.
STEP 2: Mechanical data cleansing
There are various ways to cleanse data, but first, among the items to be cleansed discovered in STEP 1, correct or delete those that can be processed mechanically in bulk. Specifically, it is a good idea to narrow down the process to standardizing variations in notation, correcting incorrect entries, and eliminating duplicate items (name matching).
The purpose of this step is to mechanically cleanse a large number of items and roughly organize the data. Because a large number of items are being processed, specialized tools are usually used to comprehensively improve the data.
STEP 3: Manual data cleansing
If there are items that cannot be improved by mechanical cleansing using specialized tools, visual cleansing processing will be required. Specifically, this includes deleting duplicate data and unnecessary data.
Recent cleansing tools process large amounts of corporate databases, so cleansing can be done almost automatically, but there are cases where they cannot handle changes in company name, closure, or changes in person information (job title, department name, last name, etc.). In order to eliminate such leaks, the data that has been cleansed by the tool must be visually inspected and checked again by a human.
STEP 4: Cleansing treatment according to the purpose
Once you have corrected and deleted any inconsistencies in data notation, typos, or duplications, the next step is to perform a cleansing process that suits the data's intended use.
For example, even if they are all called customer lists, the lists managed by the sales department and the marketing department may have different items. The sales department's list may include the names of the people in charge and their key superiors, while the marketing department's list may emphasize information such as the customer's prospects and inflow channels. When data from both departments is combined in an internal data integration process, the items will naturally not match, resulting in missing data.
If you just want to know the names of your company's customers, this missing data isn't much of a problem. However, if you have a specific purpose, correcting these missing parts during the cleansing stage will make the data more useful later.
STEP 5: Monitoring data quality
Data cleansing is not a one-time process. Data such as customer lists, product management data, and employee performance are accumulated daily in the course of business, and the data is constantly increasing. Therefore, even after the data cleansing process is complete, it is necessary to define rules to prevent inconsistencies in spelling, errors, and duplications, and to monitor whether the data is being operated in accordance with those rules.
In addition to checking the data itself, it is also important to create a system to prevent incorrect data entry by eliminating free-form input fields and implementing system controls. If such monitoring is neglected, data quality will gradually deteriorate over time.
Data cleansing target items
What specific items should you modify or delete during data cleansing? We have compiled a list of the most common items you should check.
Inconsistent spelling (adjusted inconsistent spelling)
-
-
- Company name: Unification of legal entity such as "Kabushiki Kaisha", "(Kabushiki Kaisha)", "Kabushiki Kaisha", "Ka", etc.
- Personal name: Separation method for first and last name (full-width space, half-width space, cell, etc.)
- Address: Postal code hyphens, street address notation (full-width numbers, half-width numbers, Chinese numerals), etc.
- Phone number: Area code notation, whether or not to use hyphens, etc.
Missing data (partial data missing completion)
-
-
- Missing company or person names: Company name, legal entity, person name, job title, etc.
- Missing data required for specific purposes: listing status, number of employees, sales, industry, product purchase history, customer inflow channels, etc.
Correcting typing errors (typing mistakes, correcting incorrect information)
-
-
- Company name: Alphabet, katakana, yo-on sounds, presence or absence of dots, etc.
- Legal personality: Position of legal personality, type of legal personality (joint stock company, limited liability company, limited partnership company, general partnership company, NPO corporation, general incorporated association, general incorporated foundation), etc.
- Address: Responding to name changes due to municipal mergers, etc.
- Telephone numbers: Addressing area code changes, etc.
Excel functions that can be used in data cleansing
Generally, dedicated tools are used for data cleansing, but when the amount of data is small or a large proportion of the work is done by human eyes, the versatile and easy-to-use Excel can be used. In such cases, data organization and name matching can be carried out efficiently by utilizing the following functions.
COUNTIF: Finding duplicates in data
The COUNTIF function is useful for checking for duplicates during cleansing, as it searches a specified range for and displays the number of data items that meet certain criteria. Excel's "Remove Duplicates" function can be used to remove duplicate company names or personal names, but this can be difficult to use when there are multiple companies with the same name. Always use the COUNTIF function to visually check the relevant data.
[=COUNTIF(range, search criteria)]
DATEDIF: Function to calculate a period
The DATEDIF function is useful when you want to find time, such as the number of years of business or the length of membership. This function displays the period (number of years, months, days, etc.) between a specified start date and end date, and can be used when organizing data for sales strategies or marketing, such as when you want to find the number of years of business or the length of membership.
[=DATEDIF(start date, end date, units)]
VLOOKUP: Finding corresponding values from specific data
The VLOOKUP function extracts corresponding values from a specified range, which is particularly useful when manually editing each item. It is useful for efficiently processing large amounts of data, such as extracting company names and contact names from customer codes, or ingredient names and prices from product names.
[=VLOOKUP(search value, range, column number, search type)]
TRIM: Remove unnecessary space
It will only leave spaces between strings in the specified cell and delete all other spaces. This is useful when sorting personal names, for example, when you want to leave only the space between the first and last name and delete all other spaces.
[=TRIM(specified cell)]
CLEAN: Remove unnecessary line breaks
This function removes line breaks from the text in the specified cell and displays it. This is a useful function for organizing free-form fields such as notes about customer information or survey responses.
[=CLEAN(specified cell)]
Cleansing is essential for effective use of data
Data cleansing is an essential process for improving the quality of databases and making them useful for business decisions, planning sales strategies, etc. It is important to prepare data according to the purpose and improve the quality of the database.
The importance of data cleansing and the benefits it brings are explained in detail in the following article. Please take a look at this as well.