How to proceed with data cleansing | Explaining the specific steps and points to note

When utilizing data, data cleansing is essential to increase the accuracy and consistency of information. Leaving incorrect or duplicate data untreated not only undermines the reliability of analytical results, but also has a negative impact on decision-making.
In this article, we will explain the process and specific methods for achieving effective data cleansing, as well as organizational techniques in Excel, and offer tips to help you use data more smoothly in your business.

Data cleansing process and points to consider

First, let's organize the overall flow of the cleansing process and the points that should be considered in advance.

Data cleansing begins with clearly defining the scope of the data and understanding its quality, as missing or inconsistent information can lead to erroneous analysis results and poor business decisions.

Additionally, setting data quality standards and defining detailed correction and input rules will help the cleansing process proceed smoothly. Care must be taken to avoid vague standards, as this can lead to inconsistent notation and inconsistent judgment within the team, ultimately increasing the risk of compromising overall quality.

A foolproof data cleansing process

Generally speaking, the trick to avoiding failure is to follow these steps:

  1. Define the purpose of using data
  2. Define the data quality required to achieve your objectives
  3. Check the current data and identify issues
  4. Prioritize and fix

Next, let's look at some points to consider when cleansing data.

Setting cleansing targets and analyzing data quality

First, clarify the purpose and scope of what data needs to be corrected and to what extent. It is efficient to focus on checking text that contains a mixture of full-width and half-width characters, or parts where the date and address formats are inconsistent. When analyzing data quality, you can set appropriate priorities by showing the current data state using numbers and indicators and identifying problems.

For example, in the case of customer information, it is a good idea to check for duplicate records, whether there are many variations in notation, etc. If the purpose of using the data is clear, it will be easier to prioritize the areas that need to be cleaned, allowing you to optimize the allocation of resources.

Setting accurate data quality standards

Setting indicators for data accuracy and completeness in advance will allow you to specify the direction of cleansing more specifically. For example, it will be smoother if you clearly state the required input conditions for each item, the acceptable format, character limits, etc., and decide on a policy for correcting or supplementing any deviations.

However, having too many indicators can make the process complicated, so you should also consider ease of operation, such as prioritizing the implementation of indicators that are directly linked to business use.

Establish data validation rules

By clearly defining input rules, you can minimize the risk of inappropriate and unnecessary data being entered. For example, it is important to set detailed rules, such as encouraging input of address fields by dividing them into prefecture, city, ward, town, village, etc., and standardizing date formats to the Gregorian calendar.

Such rules will also contribute greatly to ensuring consistency later on. Implement rules not only for databases and management systems, but also for input situations such as Excel and questionnaire forms.

Building a data cleansing workflow

Data cleansing is not a one-time process; quality can be maintained by performing it regularly. Design a system that allows your organization to continue working on the process by assigning personnel, selecting tools to use, and standardizing work procedures.

Specifically, it is recommended to establish a process whereby data to be cleansed is extracted monthly or quarterly, corrected based on rules, and after a final check, the data is stored and sent for analysis.

Specific steps for data cleansing

Next, we'll provide step-by-step instructions and tips for efficient cleansing.

In data cleansing work, it is effective to clarify the target items and then separate the parts that can be corrected mechanically from the parts that require human judgment. By expanding the scope that can be processed with tools and scripts, the complexity can be significantly reduced. On the other hand, typographical errors and inconsistencies that can only be noticed by humans must ultimately be corrected by visual confirmation.

Data cleansing target items

For example, customer names, addresses, and phone numbers are items that are prone to formatting errors. Data aggregated from multiple systems often has inconsistent input rules, and discrepancies in full-width and half-width characters, katakana and hiragana notation, and other such discrepancies tend to accumulate. Prioritizing the review of such items will lead to improved efficiency in the overall cleansing process.

Also, handling unnecessary data and irrelevant items will increase work costs, so it is best to start with items that are directly related to analysis and business decision-making.

Inconsistent spelling (adjusted inconsistent spelling)

  • Company name: Unification of legal entity such as "Kabushiki Kaisha", "(Kabushiki Kaisha)", "Kabushiki Kaisha", "Ka", etc.
  • Personal name: Separation method for first and last name (full-width space, half-width space, cell, etc.)
  • Address: Postal code hyphens, street address notation (full-width numbers, half-width numbers, Chinese numerals), etc.
  • Phone number: Area code notation, whether or not to use hyphens, etc.

Missing data (partial data missing completion)

  • Missing company or person names: Company name, legal entity, person name, job title, etc.
  • Missing data required for specific purposes: listing status, number of employees, sales, industry, product purchase history, customer inflow channels, etc.

Correcting typing errors (typing mistakes, correcting incorrect information)

  • Company name: Alphabet, katakana, yo-on sounds, presence or absence of dots, etc.
  • Legal personality: Position of legal personality, type of legal personality (joint stock company, limited liability company, limited partnership company, general partnership company, NPO corporation, general incorporated association, general incorporated foundation), etc.
  • Address: Responding to name changes due to municipal mergers, etc.
  • Telephone numbers: Addressing area code changes, etc.

Mechanical Data Cleansing

Mechanical data cleansing is a method of automatically correcting and integrating errors and duplications using set rules and scripts. For example, it can convert full-width characters to half-width characters, or automatically normalize predefined patterns using Excel functions or Python libraries. When dealing with large amounts of data, automating the process can significantly reduce the amount of work required.

However, to fully automate the process, the rules must be set up in advance with great care, and some manual checking is essential if there are many unexpected patterns in the violation data.

Use of automation and dedicated tools

Using ETL tools and AI technology, you can go beyond simple normalization and automatically perform complex pattern recognition and name matching. AWS Glue and Talend, among others, support the conversion and cleansing of large amounts of data, and introducing them will further reduce manual work.

The key is to consider the optimal level of automation, such as Excel macros or cloud services, based on the size of the data and the skill set of your team.

Manual data cleansing

Human eyes and judgment are essential to find special cases that tools and scripts cannot cover, and subtle variations in notation. By supplementing areas that are easily overlooked by mechanical processing, the reliability of the data can be greatly increased.

At the same time, manual checks are costly and time-consuming, so it is important to clearly define the scope and frequency of regular checks and establish a system that allows them to be carried out without difficulty.

How to detect and correct spelling variations and duplicate data

For example, if there are variations in the spelling of a company name (such as whether it is a joint-stock company or not) or if there are duplicate or missing postal codes, it is effective to manually list them and correct them all at once. Common approaches include using filters in Excel to edit multiple rows at once, or using SQL to perform a similar search.

Standardizing spelling variations can affect a company's credibility, so establish correction rules and ensure consistent management.

Missing value imputation methods

Missing values require measures tailored to the purpose, such as referencing other sources to fill in estimated values, or setting default values if imputation is difficult. For example, if the address field in customer data is blank, imputing it at the city, ward, town, or village level will make it easier to maintain the accuracy of area analysis.

Additionally, if the number of missing important items exceeds the acceptable range, it is important to consider long-term measures such as reviewing the data entry process or the system itself.

Organizing data in Excel and using functions

We will explain the key points for formatting and managing data using Excel, a familiar tool.

Excel is frequently used in business settings, making it a convenient tool for quickly starting data cleansing. For example, by using VLOOKUP or INDEX functions to match master information in another sheet, you can automatically complete missing values and correct clerical errors more efficiently.

It's also effective to use conditional formatting to highlight outliers and duplicate data, making it easier to spot errors visually. By combining this with manual work, you can accurately clean data that requires detailed checking.

Excel functions that can be used in data cleansing

Generally, dedicated tools are used for data cleansing, but when the amount of data is small or a large proportion of the work is done by human eyes, the versatile and easy-to-use Excel can be used. Below we introduce Excel functions that can be used to efficiently organize data and perform name matching.

COUNTIF: Finding duplicates in data

The COUNTIF function is useful for checking for duplicates during cleansing, as it searches a specified range for and displays the number of data items that meet certain criteria. Excel's "Remove Duplicates" function can be used to remove duplicate company names or personal names, but this can be difficult to use when there are multiple companies with the same name. Always use the COUNTIF function to visually check the relevant data.

[=COUNTIF(range, search criteria)]

VLOOKUP: Finding corresponding values from specific data

The VLOOKUP function extracts corresponding values from a specified range, which is particularly useful when manually editing each item. It is useful for efficiently processing large amounts of data, such as extracting company names and contact names from customer codes, or ingredient names and prices from product names.

[=VLOOKUP(search value, range, column number, search type)]

TRIM: Remove unnecessary space

It will only leave spaces between strings in the specified cell and delete all other spaces. This is useful when sorting personal names, for example, when you want to leave only the space between the first and last name and delete all other spaces.

[=TRIM(specified cell)]

CLEAN: Remove unnecessary line breaks

This function removes line breaks from the text in the specified cell and displays it. This is a useful function for organizing free-form fields such as notes about customer information or survey responses.

[=CLEAN(specified cell)]

For effective use of data

To ensure that the cleansed data is useful, continuous management and improvement are required.

Data cleansing is an important initiative that supports the foundation of business decision-making and customer understanding. The real work begins after cleansing is complete, and it is essential to create a system for utilizing the purified data.

Creating effective rules and maintaining consistency

It is important to prepare data entry guidelines and manuals to minimize inconsistencies between systems and personnel. At the same time, logging update history will enable you to track when, who, and what modifications were made, thereby building a foundation for continuous quality control.

During this process, by listening to feedback from the field and implementing a system to quickly resolve operational issues, it becomes easier to maintain consistency over the long term.

Regular visual inspection

There are also small errors that are difficult to detect through automated system checks alone, so it is important to make corrections based on feedback from the field where the data is being used.Variations in quality can be reduced by regularly visually checking important data and new data imported from outside, in particular.

By accumulating these operations, high-quality data will be accumulated throughout the company, leading to improved analytical capabilities and customer response capabilities.

summary

Data cleansing is an essential process for improving the quality of databases and making them useful for business decisions, planning sales strategies, etc. It is important not only to perform cleansing once, but also to continuously manage quality and respond promptly when problems arise.

By utilizing accurate data, we can expect to achieve beneficial results in many areas, including the future use of generative AI, the discovery of new business opportunities, and improved customer satisfaction.

Why not try data cleansing to grow and strengthen your business?

The person who wrote the article

Affiliation: Marketing Department

Yumi Ogawa

After two years of experience as a copywriter at an advertising agency, she has been working in the IT industry ever since. Her experience at a variety of companies, from B2C to B2B, and from Japanese ventures to major foreign companies, is her strength. She has consistently worked in a variety of marketing-related roles, including public relations, branding, product marketing, and campaign management, and has been in her current position since May 2024. In her private life, she loves interacting with nature, hot springs, and public baths.
(Affiliations are as of the time of publication)

Recommended Content

Related Content

Return to column list