Systematically organize the causes and solutions for data inconsistencies

Yoko Tsushima

6 minutes to read

Systematically organize the causes and solutions for data inconsistencies

Data inconsistency is not a "random accident." It is a reproducible failure that is closely tied to modern system architectures, such as concurrent updates, distributed processing, caching, and asynchronous processing.
In this article, we will organize typical patterns of data inconsistency, causes in design and implementation, common examples of failure, detection and repair (Write Repair / Async Repair), and even a prevention checklist in a series of steps, providing a perspective for distinguishing between "inconsistencies that can be prevented" and "inconsistencies that should be designed with this in mind."

iPaaS system integration data utilization

What is data inconsistency?

An "inconsistency" is a discrepancy between the expected correct state and the actual saved state.
For example, if an order is successful, the inventory is reduced, the payment is recorded, and it is displayed in the purchase history-this is a complete state. If any of these are missing, or if the correspondence with related tables or other systems is broken, it is an inconsistency.
Symptoms vary.

The screen display is incorrect
The totals don't match
Double billing occurs
inventory goes negative
The same operation is reflected twice

A key feature of this method is that it does not show up as an error. It is possible for discrepancies to occur, such as a successful response failing behind the scenes, or a failed response being partially written.
The impact goes beyond a poor user experience, extending to increased time spent handling refunds and investigations, legal and accounting risks, and even poor decision-making due to unreliable metrics.

What's even more troublesome is that an inconsistency can destroy the assumptions of other processes, causing a chain reaction of subsequent inconsistencies. This is why it's important to design and operate in a way that prevents damage from spreading early and confines the scope of the impact.

Typical patterns of inconsistency

Inconsistencies can be patterned to some extent. If you can classify the phenomenon, you can quickly identify the cause and come up with a solution.
The first thing to check is what is wrong.

Single Record vs. Aggregate
Is only one side missing?
Are you seeing the old value?
Is it doubled?

Typical patterns can be summarized into the following five categories:

1. Simultaneous updates and competition (race conditions)

When multiple requests simultaneously update shared data, overwrites and drops occur.
A typical example is when the process of "reading the current value, calculating, and updating" runs in parallel. The last one wins and the previous update is erased.
The condition for this is the lack of conflict detection via locks or optimistic locking. In a distributed environment, concurrency increases and becomes less visible in unit tests.

Countermeasures include appropriate isolation levels, row locking, optimistic locking, serialization using queues, etc. Furthermore, "converging to the correct result even if multiple transactions occur simultaneously" idempotent Should By focusing on design, the root cause of accidents will be reduced.

2. Partial updates/missing updates (undefined/NULL mixed in)

This can be caused by a failure during a multi-field update or an unintended overwrite due to a partial update specification. In practice, a common case is when undefined or NULL values are mixed into the update object, causing an error on the storage side. If the failure is ignored, it will result in an inconsistency that appears to be a success.

The solution is to clarify the meaning of unspecified and NULL. Normalize the input and stop it on the database side with NOT NULL or CHECK constraints. Partial updates are convenient, but they can become a breeding ground if the update contract is unclear.

3. Synchronization issues between multiple systems

When multiple databases or external SaaS are involved, it is inevitable that only one side will succeed. Furthermore, retries can result in duplicate processing. Distributed systems must be designed with mismatches in mind.

The basic solution is to visualize idempotency and state transitions. Using idempotency keys, state machines, sagas, etc., you can record how far you have progressed and resume from where you left off.

4. Cache and DB inconsistency

This is due to cache-first reads and missed invalidations. A design that relies on TTL assumes that old values will remain.

The solution is to design the cache so that it is not the final truth, to unify the update path, and to have metrics to detect missed invalidations.

5. Asynchronous Processing and Event-Driven Gap

Delays, out-of-order deliveries, and duplicate deliveries are inevitable.
Ensuring exactly once is not achieved by the infrastructure alone; idempotent design on the application side is essential. Idempotent processing, duplicate detection, and key-level ordering are essential.

Design and implementation causes of inconsistencies

Inconsistencies may appear to be a data-only problem, but the root cause is often a mistake in the design or implementation assumptions or a lack of safeguards. We break down the cause into layers and identify measures that directly prevent recurrence.
Even if the same inconsistency occurs, breaking it down to where it should have been prevented leads to division of responsibility between input boundaries, application logic, database constraints, transaction boundaries, and external connections.The key to direction is not to just end with recovery response, but to consider at which layer it is most cost-effective and reliable to stop it.

In the field, decisions such as "It's okay because validation is done at the front desk" or "There's no need for restrictions because it's checked in the app" are often the starting point for accidents. Input is done through multiple routes (admin screen, batch, external integration), app checks are easily missed, and specifications change can cause the system to break down.
The basis for preventing recurrence is to have multiple layers of defense: boundary validation, database constraints, transactions and idempotence, monitoring and repair, and so on, so that even if an attack does slip through, it will be stopped at the next step.

Design and implementation causes of inconsistencies

Lack of validation (input, update, boundary)

Lack of validation not only leads to the continued entry of incorrect data, but also to "proper updates not being performed" and "state transitions being broken." In particular, if the granularity of validation is not consistent at each point where data enters the system, such as the API boundary, service boundary, or batch boundary, inconsistencies will easily recur.
Partial updates can be dangerous if the meanings of unspecified, undefined, NULL, empty string, 0, etc. are mixed. A contract should be clearly written down, such as leaving unspecified unchanged and deleting NULL, and the input should be normalized accordingly. Even in languages with types, expressions like "Partial" can easily allow undefined values to be mixed in, so effective countermeasures include creating a separate DTO for updates and removing undefined values on the server side before saving.

Domain invariants are also subject to validation. For example, constraints such as stock not falling below 0 or an order not being shipped before payment cannot be maintained on the screen alone. State transitions must be strictly controlled on the server side, and it must be determined where invariants are guaranteed, including exception routes.

▼I want to know more about the API
⇒ API｜Glossary

Insufficient constraints (primary key, foreign key, unique, not null)

In an environment with weak database constraints, data that should not be entered will quietly accumulate, and when problems become apparent, the cost of repair will increase dramatically. Because application checks can easily create loopholes by changing specifications or adding exception handling, it is desirable to design the database to act as a "last resort."

The typical moves are as follows:

Unique constraints physically prevent duplicates from being created
Enforcing referential integrity with foreign keys
Prevent missing or out-of-range values with NOT NULL or CHECK constraints

The unique constraint is particularly important as it forms the basis of idempotent design and structurally prevents duplicate registrations even when retries or multiple requests occur.

On the other hand, retroactive constraints cannot be applied if there are violations in existing data, so design enhancements must be implemented in conjunction with a gradual cleansing plan.
The positioning of constraints becomes clear if you think of them not as simple database settings, but as decisions that fix business rules in the persistence layer.

Transaction errors (including distributed transactions)

Even in a single database, if the transaction boundaries are unclear, mid-commit or split updates will occur, and the state where only one side has been updated will quietly remain.

When updating multiple tables, it is necessary to define at the design stage "to what extent the processing should be atomic" and to confirm that the implementation will ensure rollback if an exception occurs.
When working across multiple services or multiple databases, it is often difficult to adopt a general distributed transaction.

Saga
State Machine Design
TCC

We adopt a compensatory design, such as the one above, and provide a structure that can restore consistency assuming failure.
The key here is retry design and idempotence. Only by classifying errors into those that can be retried and those that cannot, and by implementing a design that can absorb duplicate processing using a unique key per request, can the weaknesses of transactions be compensated for.

Data inconsistency detection and repair (Write Repair/Async Repair)

Given that it is impossible to reduce inconsistencies to zero, it is important to have a system that can find them quickly and repair them safely. We use both Write Repair, which repairs synchronously, and Async Repair, which repairs after the fact.

The basis of detection is to translate the expected invariants into a form that can be checked mechanically. For example, you can make it possible to check conditions such as whether a reference always exists, whether the total value matches the sum of the details, and whether the state transition follows the rules using queries or batches. For monitoring, you can turn the number of consistency checks, number of repairs, number of retries for the same key, and suspected duplicate creation into metrics to quickly detect signs of a problem.

Write Repair is a method that prevents inconsistencies from occurring during the write process or fixes them on the spot. Retry is the most common method, but it is necessary to identify errors that can be retried and prevent multiple processes using idempotent keys or unique constraints. State management (process ID and current state) allows you to safely resume processes that have progressed partially, which increases the success rate of Write Repair.

Async Repair is a method of performing matching and repairs after normal processing has finished. It involves matching databases in regular batches to make up for any shortfalls, and re-executing any processes that have not yet finished to complete them. It is important to note that the batch itself may be executed at least once, or may be started multiple times, so the repair process must also be idempotent, and locks and status checks must be implemented to prevent it from conflicting with the main process and overwriting it.

Practical Checklist

To ensure that nothing is overlooked in the design review, implementation, and operation phases, cause patterns and countermeasures are broken down into checklists that can be reused. A checklist is more effective if you evolve it by adding items each time a problem occurs, rather than creating it once and then being done with it.

design

Define the area where strong matching is required
Is the Single Source of Truth clear?

implementation

Is there a way to update simultaneously?
Is the renewal contract clear?
Are you designing idempotent keys?
Are transaction boundaries appropriate?

Operation

Are regular consistency checks performed?
Is the repair job idempotent?
Are there isolation procedures in place for inconsistencies?

The idea behind reducing consistency risks with data integration platform

As we have seen so far, many data inconsistencies arise from structures specific to distributed processing, such as "interconnection between multiple systems," "asynchronous processing," "retry," and "mismatched update sequences." While it is important to implement countermeasures on the individual application side, having a platform that can control data integration itself can significantly reduce the probability of inconsistencies occurring and the scope of their impact.

For example, by using data integration platform like HULFT Square, you can manage API integration, file integration, event processing, and more in an integrated manner, centralizing integration logic that was previously scattered across systems. Because process status management, re-execution, error handling, schedule control, and more can be handled as a common mechanism, it is possible to reduce operational risks such as not knowing how far processing has progressed or not noticing failures along the way.

In particular, in operations that involve transferring data between multiple systems, it is important to be able to visualize the processing flow and design re-execution and supplementary processing in the event of a failure in order to maintain consistency.By utilizing data integration platform, it is possible to organize the integration logic that was previously distributed across each system and create a structure that allows you to track the processing status and execution history.
As a result, it becomes possible to design a system that controls integrity risks at the integration layer, rather than implementing individual countermeasures for each application.

▼I want to know more about data integration / data integration platform
⇒ data integration / data integration platform | Glossary

summary

Data inconsistency is not an accidental accident, but a reproducible problem that inevitably arises from structures such as simultaneous updates, distributed processing, and asynchronous collaboration. Therefore, rather than aiming for zero, the practical and sustainable approach is to incorporate into the design in advance "where to stop" and "where to retrieve."
Please use this article as a reference to see if you can specifically answer the following three questions: "What data must never be leaked?", "At what layer is it protected?", and "If data does slip through, where can it be detected and repaired?" If there are any unclear boundaries, start by adding another line of defense there.
Based on the premise that consistency is maintained through structure, not attention, simply clarifying the boundaries of responsibility in one place can ensure that the next inconsistency is minimized.

The person who wrote the article

Affiliation: Marketing Department

Yoko Tsushima

After joining Appresso (now Saison Technology), he worked as a technical sales representative, in charge of technical sales, training, and technical events. After leaving the company to return to his hometown, he rejoined the company in April 2023 under the remote work system. After gaining experience in the product planning department, he is currently in charge of creating digital content in the marketing department.
(Affiliations are as of the time of publication)