RAG's data governance realized through iPaaS

Shinnosuke Yamamoto

6 minutes to read

The rise of generative AI is moving data utilization in companies to a new stage. In particular, by adopting Retrieval Augmented Generation (RAG), a technology that improves the accuracy of generative AI responses, anyone in a company, whether they are a business or IT professional, can freely use data via generative AI without worrying about hallucination.
In this column, we will introduce how using iPaaS (Integration Platform as a Service) can ensure governance in the use of RAG and how it can be used safely and efficiently in business.

What is Data Governance in RAG?

Data governance is a framework for controlling the entire process of collecting, storing, using, and disposing of data, which is a corporate asset, and ensuring the quality, integrity, and security of data. It is also defined as one of the 11 areas of the DMBOK (Data Management Body of Knowledge), which is translated as "Data Management Body of Knowledge Guide" in Japan, and is an important perspective that companies should consider when strategically utilizing data.

Similarly, when handling corporate data through generative AI, it is important to consider data governance when implementing RAG applications such as chatbots and workflows. So, what specific aspects of data governance in RAG should be considered? For example, we will consider quality, security, and cost-effectiveness.

▼I want to know more about RAG
⇒ Retrieval Augmented Generation (RAG) | Glossary

1. Quality: Accuracy of the generated AI's answers

First, quality refers to the results of the RAG application, i.e., the accuracy of the generative AI's answers. The accuracy of RAG answers can be evaluated primarily in terms of input, process, and output. Input refers to the corporate data referenced by the generative AI. If the content of the referenced corporate data is incorrect, the answer results based on it will not be correct. An example of the process is the search function. Even if you have all the data, it is meaningless if the search function cannot properly find the necessary data. Output refers to the generative AI's ability to answer questions. Even if you can obtain appropriately high-quality data using the search function, you will not be able to obtain the information you want if the generative AI's generation ability is low.

2. Security: Safe and secure management of data

Second, security refers to the safe and secure management of corporate data. This does not only apply to security outside the company, but also needs to be considered within the company. For example, personal information such as employee addresses and evaluations should not be made public to anyone, even within the company. In many companies, data related to such personal information is kept closed off in business systems, or if stored in a data warehouse,it is masked (making certain field information inaccessible). RAG also requires that appropriate users have access to data within the appropriate scope.

3. Cost efficiency: Optimizing operational costs

Third, cost efficiency refers to optimizing the costs of operating RAG. Costs include not only the usage fee for the RAG application (including charges for using the generative AI) but also the human costs involved in maintaining this system. Even if a RAG application is built, if many engineers have to operate it manually behind the scenes, it will be difficult to devote time to other tasks that should be carried out. It is important to consider how to keep operating costs at a necessary level.

RAG's data governance realized through iPaaS

There are various ways to implement RAG, but Saison Technology recommends building RAG using iPaaS. iPaaS is an abbreviation for Integration Platform as a Service, and provides a mechanism for connecting various business systems and generative AI services owned by a company, integrating and utilizing that data.

RAG, which is realized with iPaaS, can realize various functions that enable safe and secure use of corporate data from the perspective of data governance, as mentioned earlier. This time, we will introduce four functions related to data governance.

▼I want to know more about iPaaS
⇒ iPaaS | Glossary

1. Improved search accuracy through query expansion

Query expansion is a technique used to improve search accuracy in RAG. Vector databases are generally used as search functions in RAG, but vector databases receive search terms from outside as queries and search for and return elements that are closest to the search terms. Query expansion increases the search hit rate by slightly modifying the query.

Specifically, synonyms are included in the query. In RAG applications, users issue instructions to the generation AI in human language. Because words are used with various nuances and expressions by different people, the same phenomenon may be expressed using different words. However, because the data is recorded using a specific, single expression, variations in human language can make it impossible to identify the data. To absorb this variation in language, synonyms are generated from the user's instructions, and these, combined with the original instructions, are used as search terms to query the vector database.

With iPaaS, which can realize workflows related to data processing, it is easy to add a process that calls a generative AI that generates synonyms before querying a vector database, and to easily expand queries. This is expected to improve search accuracy compared to simple searches.

▼ Want to know more about vector databases?
⇒ Vector database | Glossary

2. Controlling data access based on user permissions

Data access control ensures governance by searching and referencing appropriate data according to user permissions. Among the data you want to utilize, there may be confidential information, such as trade secrets or personal information, that should only be used within a department or by management only. However, setting and managing such access control using various tools, such as data storage and AI, is not operationally efficient.

With iPaaS-based RAG, it is possible to inherit permissions from the data source, assign appropriate permissions to the search, extraction, and analysis processes, and dynamically switch processing content. For example, it is possible to query a human resources master based on the name and email address of the user running the RAG chatbot to determine whether they are a general employee or a manager, and then branch the list of searchable data based on those attributes. If access control is implemented on the data source side, when retrieving data from the data source, the login information required for access is separated based on the user's attributes, and the data is accessed with the appropriate permissions.

This enables the entire organization to make effective use of data assets safely and securely.

3. Cache accumulation and management

Cache refers to temporarily stored data. In RAG applications, the results of recent data searches, answers, and positive user feedback are saved as cache, and are reused as needed for future similar data usage. By establishing a system that allows the use of cache, various effects are expected, such as ensuring reproducibility, improving response speed, and reducing costs.

When using data on demand, there are often requests to reproduce past best practices, such as "I want to output the same content as before" or "I want to make that graph again in the same format as before." By using a cache, it is possible to provide the same answer as in the past, or to perform analysis using the same process and procedures as in the past. This makes it possible to reproduce past best practices, and is expected to shorten response times by skipping generation processes as appropriate.

Additionally, many generative AI models are charged based on the consumption of tokens (units of input and output) for input and output. By using a cache, the same generation process as in the past can be skipped, reducing the token consumption for answers. For example, generating a large amount of text or images requires a considerable number of tokens, so generating the same thing each time can be quite costly. iPaaS can contribute to cost reduction at RAG by enabling storage and search in a cache database.

4. Automatic Metadata Updates

Metadata refers to data about data. Examples include the date the data was last updated, where it is stored, its meaning, and calculation formulas. When using data with generative AI, metadata is an important supporting resource for generative AI to interpret the data. However, creating and managing this metadata is a heavy workload for many data administrators. With iPaaS-based RAG, the metadata management process is defined as a data pipeline, enabling a mechanism for automatic updates using generative AI.

While metadata is important in data utilization, for example, in the case of structured data, preparing information for every table and every column as metadata requires a considerable amount of research and work. With automatic metadata updates, the generative AI infers the type of data based on the table definition (DDL) and data contents, verbalizes it, and registers it as metadata. This is expected to reduce the operational burden on data maintenance personnel.

Furthermore, while metadata is important reference information for users to find the data they want in vector searches, in reality, the expected results may not be obtained due to differences in the spelling of the words entered by the user (for example, "employee" vs. "employee") or ambiguity (for example, "productivity" vs. "performance"). Generative AI identifies the link between the user's input and metadata, adds words to the necessary metadata, and performs tuning to increase the hit rate during searches.

Finally

What did you think? While RAG has the potential to be a major key to enabling everyone, including IT and non-IT employees, to utilize data, it is essential to create a system that allows data to be used safely and securely in order to deploy it company-wide. With RAG, which is realized through iPaaS, it is possible to implement various data governance mechanisms such as those introduced here as a data pipeline. Of course, there is more than one way of implementing data governance, and what we have introduced here is not everything. Please feel free to contact Saison Technology to discuss how to incorporate the data governance required by your company into RAG.

In another column, we will introduce how data integration can maintain the quality of corporate data and what kind of data utilization can be achieved by having access to various data and various generative AI, so please take a look at the recommended content below.

The person who wrote the article

Affiliation: Data Integration Consulting Department, Data & AI Evangelist

Shinnosuke Yamamoto

After joining the company, he worked as a data engineer, designing and developing data infrastructure, primarily for major manufacturing clients. He then became involved in business planning for the standardization of data integration and the introduction of generative AI environments. From April 2023, he will be working as a pre-sales representative, proposing and planning services related to data infrastructure, while also giving lectures at seminars and acting as an evangelist in the "data x generative AI" field. His hobbies are traveling to remote islands and visiting open-air baths.
(Affiliations are as of the time of publication)