What are the requirements for the data infrastructure required for RAG?

Data Utilization
Data Infrastructure
Generation AI

Shinnosuke Yamamoto

5 minutes to finish reading

Last time, we introduced some methods for adapting generative AI to the realities of corporate activities.
This time, we will focus on one of the many approaches, "RAG (Retrieval-Augmented Generation)," and introduce the key points for developing the data infrastructure that supports the knowledge base required to realize RAG.

What is RAG (Retrieval-Augmented Generation)?

One approach in which generative AI provides answers based on company-specific data is "RAG." RAG is called "search augmentation generation," and is a system in which the generative AI provides answers by separating the search and generation processes. First, a user posts a prompt (instructions for the generative AI), and a search process is executed to search for the data needed to answer the prompt. Based on the data obtained as a result of the search, a generation process is executed, and the final answer is output.

The knowledge base (information source) that the search process searches plays a major role here. As the name suggests, a knowledge base is a place where knowledge resides. Knowledge in this context is company-specific information that generative AI should refer to.

▼I want to know more about RAG (Retrieval-Augmented Generation)
⇒ Retrieval Augmented Generation (RAG) | Glossary
▼I want to know more about generative AI
⇒ Generative AI | Glossary

The data infrastructure that supports the knowledge base

Companies hold information in a variety of forms, such as files and databases. Often, these information assets exist in a fragmented state across various departments and users within an organization. A fragmented state refers to a situation where there are many inconsistencies and inconsistencies within the data and between data, resulting in a lack of unity and consistency. In this state, it is difficult for people to use the data, let alone for generative AI to use it as a knowledge base.

The key here is to develop a data infrastructure. A data infrastructure is a platform for integrating various data scattered both inside and outside the company and making it usable. Through the data infrastructure, scattered data is collected and accumulated, making it usable as a knowledge base.
Here are some points to consider when developing such a data infrastructure. There are many points to consider, but this time I will focus on three: 1) data quality, 2) data reliability, and 3) data integration.

What is data quality?

Data quality is an indicator that measures how appropriate the data is for the purpose. When introducing RAG, we often face the challenge of low-quality answers from generative AI. There are various possible reasons for this, but one of them is low-quality data. If the quality of the information used as input for generative AI to answer questions is low, the quality of the output will naturally also be low.

So what exactly is quality? There are various aspects to it, but this time I would like to delve into three aspects: accuracy, completeness, and currency.

Accuracy refers to whether the data is correct. For example, it evaluates whether there are any errors in the content or format, such as a postal code in an address field or a string such as "unknown" in a field where a number should be. Naturally, if the data entered is incorrect, the output based on that data will also be incorrect.

Completeness refers to whether all necessary fields are included. It evaluates whether the fields that should be entered are properly entered. Even if data has been entered, if most of the fields are blank or marked as "unknown," the data will be useless.

Recency refers to whether data is acquired at an appropriate cycle. For example, if data that is updated daily can only be acquired six months later, the real-time nature of the data will be lost. Conversely, if data that is only updated once a month is acquired and updated every day, there will be no change and it will be meaningless. It is important to consider whether the timing of data generation and acquisition are appropriately matched.

When collecting data from its source, a data infrastructure is required to detect and correct errors based on these quality perspectives, exclude missing data, and update it at an appropriate cycle. By using a data infrastructure to put data in the correct format, data quality can be maintained and the quality of the output obtained from it can be guaranteed.

What is data reliability?

This is related to the quality mentioned above, but it is extremely important to ensure the reliability of the data. Reliability means that the data is correct when viewed by anyone and can be used based on the same understanding.
For example, let's say the sales department and the business planning department each report on sales figures at a meeting. Each department reviews the data and then creates a report, but there is a big discrepancy between the sales figures reported by the sales department and the business planning department. What is going on?

Both departments were checking the data, but the sales department was checking the sales data held in the customer management system, while the business planning department was checking the sales data held in the accounting system. Although it was the same sales data, each system actually had a different approach to the timing of recording sales for annual contracts. The sales department recorded annual contracts in a lump sum at the time of signing, while the business planning department recorded annual contracts pro rata by month. This resulted in discrepancies in the figures even though the sales data was the same.

As such, it is common for similar data to exist in multiple departments and systems. In such cases, if the definition of the data is unclear, the data formats are different, or the system owners and operational status are unknown, it becomes difficult to reach a common decision, even within the same organization.
A data infrastructure requires the development of metadata, such as clearly defining data, and data standardization to ensure that each piece of data has its own meaning and is accurate. This concept is called "Single Source of Truth (SSoT)" and is important for everyone in an organization, including generative AI, to make shared decisions. For more information on SSoT, please see this column.

Related Article:Developer Blog | Single Source of Truth

What is Data Integration?

So far, we have talked about data quality and the reliability of the data it brings. Finally, we will delve deeper into data integration to maintain quality and ensure reliability. Data integration refers to the process of making various data scattered inside and outside the company available in a centralized manner.

First, data accessibility is important. Since a company's information is stored in a variety of locations and formats, it is necessary to be able to flexibly connect to each system environment (on-premise environment, cloud environment, SaaS, etc.) and each data format (file, database, API, etc.) and retrieve data.

Next, the collected data is standardized. Data standardization means standardizing the data format and style (for example, how to write dates or how many decimal places to display) to eliminate differences in data due to differences in the data source. Standardized data makes it possible to combine, utilize, and analyze multiple data sets, even if they originate from different departments or systems.

Finally, the standardized data is send to where it is needed. For example, this could involve send master data used by multiple systems or publishing analytical data via API. By integrating data via a data infrastructure, it becomes possible to provide the necessary data to where it is needed.

▼I want to know more about the API
⇒ API｜Glossary

What is iPaaS that realizes a data infrastructure?

Now that we have explored the key points of the data infrastructure required to prepare the knowledge base required for RAG, we will finally introduce the iPaaS that will realize the data infrastructure.
As the name suggests, iPaaS (Integration Platform as a Service) is a platform for integrating data. It collects various data that is scattered across different locations, cleansing and standardizing the data to maintain its quality as mentioned above, and building a reliable knowledge base.

To realize RAG, it is essential to establish a knowledge base. To do this, it is important to establish a data infrastructure to collect and store data, and iPaaS is useful for providing reliable data.

iPaaS-based data integration platform HULFT Square

To utilize generative AI, you need to know how to capture the data your business needs. Learn more about HULFT Square, Saison Technology's iPaaS, which meets the needs of this era.

▼I want to know more about iPaaS
⇒ iPaaS | Glossary

What did you think? This time, we introduced the key points for establishing a data infrastructure to realize RAG, which utilizes data with generative AI. The existence of a data infrastructure is extremely important not only for generative AI but also for organizational members to make decisions based on data, so it is necessary to establish one as soon as possible.

As a domestic manufacturer of data integration products such as iPaaS, Saison Technology has supported the development of data infrastructures for many companies. If you are interested in RAG or the development of a data infrastructure, please feel free to contact us using the inquiry form.

The person who wrote the article

Affiliation: Data Integration Consulting Department, Data & AI Evangelist

Shinnosuke Yamamoto

After joining the company, he worked as a data engineer, designing and developing data infrastructure, primarily for major manufacturing clients. He then became involved in business planning for the standardization of data integration and the introduction of generative AI environments. From April 2023, he will be working as a pre-sales representative, proposing and planning services related to data infrastructure, while also giving lectures at seminars and acting as an evangelist in the "data x generative AI" field. His hobbies are traveling to remote islands and visiting open-air baths.
(Affiliations are as of the time of publication)