What is CDC (Change Data Capture)? – A method for securely and reliably transporting data from on-premises systems.
While the use of generative AI is gaining attention, many companies are facing the practical challenge of how to move the vast amounts of data currently stored on-premises to the cloud. Development costs, system load, and data freshness are particularly important considerations. This article will discuss Change Data Capture (CDC) as an option to overcome these obstacles. We will explain an efficient data integration mechanism for those struggling with database specification changes and operational burdens.
The challenge of data export is crucial to the construction of a data warehouse.
The rise of generative AI is increasing the need to move internal company data to the cloud. This is because there is a growing trend to combine internal data with generative AI to quickly provide answers tailored to a company's specific business and conduct multifaceted analyses.
However, in many companies, valuable data that could provide clues for management and business operations lies dormant within on-premises servers and mission-critical system, core system that have been in operation for many years. When attempting to consolidate this data into a cloud data warehouse, the first major hurdle is how to safely and securely transport the data.
The optimal method for data export varies depending on the nature and purpose of the data, but generally, the following three issues can be cited as challenges.
▼Learn more about data warehouses
⇒ DWH | Glossary
① Increase in development and operational costs
If you need to extract data from dozens or hundreds of tables, you must define and create an extraction script for each one. This is because the timing of table definitions and data updates differs for each system and table.
Furthermore, scripts that have been painstakingly built can easily stop working due to changes in the database specifications (such as the addition of columns). This means that maintenance work, such as modifying and testing the script each time, will continue.
② Concerns about system load
When it comes to corporate data, even a single table can contain a massive amount of data. Simply scanning all the data to identify differences in the database places a huge load on the database due to the sheer volume of data.
③ Data is not fresh enough
Sometimes, the data you've collected is only from yesterday. In that case, it can be difficult to meet the needs of analyzing the data in real time to implement necessary measures or to detect early signs of trouble.
Furthermore, even if there are no problems now, it is possible that in the future, the amount of data will increase too much, causing batches to fail to complete within the expected time.
CDC (Change Data Capture) as an option
A technology called CDC (Change Data Capture) is an effective solution to these challenges.
CDC (Data Continuity Control) is a mechanism that detects only the changes made to a database, such as additions, updates, and deletions, and synchronizes them with another system.
By releasing data gradually with each update, it's possible to update data in near real-time without putting a load on the system.
Comparison with the main types of CDC
There are several approaches to CDC. Understanding their characteristics and choosing the right one is important, but generally, log-based approaches are recommended for the following reasons:
- It does not load or modify the database: Since it only reads the transaction log associated with the database, it does not affect the processing of the database itself.
- Maintain high performance while capturing deleted data without fail: Some methods may fail to detect physical deletion, but log-based methods ensure that changes are not missed because you can view logs that record the change history.
- High immediacy: Changes can be synchronized in near real-time by accessing the database transaction log (a file that records the history of changes) and detecting when the log has been written and updating it.
| Comparison items | ① Query-based | ② Trigger base | ③ Log-based (recommended) |
| method | The system periodically issues SQL queries and extracts differences by examining columns such as the update date and time. | The database's trigger function is used to write changes to a dedicated table in conjunction with updates. | The transaction log output by the database is directly analyzed and extracted. |
| 1. Load on the database | high A process close to a full scan runs, slowing down the production environment. |
Medium to high Because a separate process runs every time data is written, the app's responsiveness slows down. |
small Since logs are read in a separate location from the database engine, it does not reduce the processing power of the main system. |
| 2. System modifications | need Database design changes are required, such as adding an update time column and a flag for extraction. |
need Trigger settings are required on the database side. This becomes more complex as the number of tables to be managed increases. |
Unnecessary No need to change the database table structure. |
| 3. Freshness and accuracy |
low Physically deleted data cannot be detected. |
high *Trade-off with load |
high Physical deletion can also be reproduced. |
Log-based systems, which allow for quick and complete data synchronization while minimizing the impact on the database, are an effective technology for solving this problem.
At the same time, we ensure both system operational stability and meet the business's data utilization needs.
Of course, the CDC isn't the best option in every case, and choosing the right tool for the job is crucial for successful data integration.
ETL is suitable when personal information needs to be masked, individual data needs to be processed, or data needs to be compressed and transferred due to communication constraints. On the other hand, CDC is effective when low load and data freshness are required.
How will things change if we implement CDC tools?
Let's take a closer look at how implementing a log-based CDC tool actually changes on-site operations.
The three issues mentioned earlier will be dramatically improved by the implementation, as follows:
- Automated script creation ◎: With simple operations such as entering connection information, you can complete the setup of linking a vast number of tables all at once, leading to a reduction in development time.
- Automatic tracking of structural changes: Even if columns are added or deleted in the database, the CDC tool automatically detects this and reflects it in the synchronization target, reducing operational effort.
- Performance Optimization ◎: Synchronization is performed with minimal traffic, minimizing the load on the network and database.
- Improved data freshness ◎: Updates are synchronized as soon as changes occur. You can always use fresh data.
- Metadata addition enhances analytical value: Information such as "when, who, and what operation was performed" is automatically added as columns. This is not only helpful for analysts and developers, but also improves the accuracy of analysis and the reliability of audits.
Thus, the ability to minimize the operational burden on the ground while improving data freshness and reliability is a major advantage of introducing CDC tools.
Conclusion
The important thing is to choose the most suitable method for your specific purpose.
To overcome challenges such as development and operational costs, system load, and data freshness, utilizing CDC is a very promising option.
This system allows for easy operation by on-site staff, ensures the production system is securely protected, and provides a constant stream of the latest data. Establishing this system will accelerate value creation by combining internal data with generated AI, forming the foundation for business growth.
If you have questions or concerns such as, "Is CDC suitable for our company's environment?", "What is the best option in the first place?", or "What are the specific implementation methods?", please feel free to contact us. We will support you in building an efficient data infrastructure to accelerate your AI utilization.
Saison Technology Online Consultation
If you would like to hear more about our data utilization platform, we also offer online consultations. Please feel free to contact us!
