The key is structuring data! Preprocessing to improve the accuracy of generative AI responses

data integration
Data Utilization
Generation AI

Shinnosuke Yamamoto

5 minutes to finish reading

In recent years, as the use of generative AI has rapidly expanded, data preprocessing has been attracting attention as a factor that influences the accuracy of its answers. It is expected that the risk of hallucination will be reduced by shaping and standardizing the data.
In this article, we will explain "data structuring," which is important for providing data suitable for generative AI, and introduce specific preprocessing methods.

Generate AI data, utilize data integration

What is the structure of the data?

Data can be broadly categorized into structured data with clear row and column rules, unstructured data stored in a free format, and semi-structured data, which is somewhere in between. In actual business, there are an increasing number of cases where we handle not only tabular numerical data, but also a variety of information sources such as text, audio, and images.

Structured data

This refers to data with clearly defined row and column rules, such as in a table or database.

Examples of structured data include customer master data and sales data. Because this data is stored in a predefined format, when it is used with generative AI, an approach called "Text-to-SQL" is generally used, in which the generative AI creates SQL from text.

▼I want to know more about generative AI
⇒ Generative AI | Glossary

Unstructured Data

This refers to data that does not have a set framework, such as documents, images, audio, and video. In recent years, the proportion of unstructured data in business information, such as email text, social media posts, PDFs, and scanned images, has become very large.

In its current form, it is difficult for generative AI to handle, and prior conversion and labeling are required to analyze the meaning and context. A common approach is to use OCR or speech recognition to extract text and audio and convert it into an easy-to-understand format.

semi-structured data

Like XML and JSON, it is a data format that partially includes meta-information such as tags and keys. It has the advantage of being structured yet flexibly extensible, making it easy to handle a wide variety of data.

It is easy to understand for both humans and machines, and is useful as input data for generative AI. To make it easier to handle operationally, each item is given meaning, and the more explanations and schema definitions are provided, the more advanced applications become possible.

Why Data Structures are Important for Generative AI

If insufficiently structured data is input as is, the inference accuracy of the generative AI will decrease and it will be more likely to generate answers that are out of context. In order to pass appropriate information to generative AI (such as RAG), it is essential to format the data in a way that the AI can easily interpret correctly.

Reason 1: Understanding the data and understanding the context

AI is required to correctly search for and interpret the data that users are looking for. A clear data structure is one of the important factors for improving AI's searchability and interpretability.

For example, parent-child relationships between pieces of information (such as the relationship between chapters and sections, or the main text) and parallel relationships (such as bullet points) are important elements of data structure. Clarifying the data structure using AI-OCR, structured APIs, etc. is expected to improve searchability and interpretability for AI, leading to more accurate answers.

▼I want to know more about the API
⇒ API｜Glossary

Reason 2: Ensuring consistency and accuracy of output

By structurally organizing data in a set format rather than disjointed, it becomes possible to ensure a certain consistency in the granularity and level of answers provided by AI.

In addition, by clarifying the data structure, it becomes possible to clearly distinguish between necessary and unnecessary information, and to derive answers based only on necessary information. By eliminating noise information, it is possible to prevent the generation of unnecessary or unexpected answers.

Reason 3: Processing efficiency and cost optimization

Having a clear structure for the data referenced by AI is expected to reduce processing time during searches and summarization, and reduce costs.

For companies that handle large amounts of data, the time and cost required to process each piece of data can lead to significant lost opportunities. Organizing data in advance can improve search and response performance, allowing resources to be allocated to areas where people should be more focused.

Structured format example

Documents commonly used in business can be structured logically, even if they have different file extensions, such as Markdown or JSON. This makes it easier to distinguish between the main text and headings, lists, etc., making it easier for the generation AI to quickly pick up the desired information.

Data compiled in Q&A format also makes it easier for generative AI to search for key points, as the correspondence between questions and answers is clear. By hierarchically assembling your company's knowledge base and product information, you can expect to improve the accuracy of answers.

Format example ① Markdown notation

By adding markup such as headings, lists, and links to text, you can create simple yet highly readable documents. The advantage is that it is easy to handle between systems and accurately conveys content and structure regardless of the display format.

By using heading-level information and bullet points instead of tags as data to be passed to the generation AI, it becomes easier to convey the intent of the document to the AI.

Format example 2: QA format (question and answer pair)

This is a useful method for structuring inquiries and FAQ content. LLM generates expected question and answer pairs and saves them in a format such as CSV.

By storing data in Q&A format, it is expected that answers can be quickly and accurately derived by searching through a comprehensive list of questions, rather than having to think up answers from the original data each time a user asks a question.

Format ③ JSON/XML format

Its greatest feature is its ability to define a hierarchical structure, making it a format that is easy for systems to read and write to each other. Elements can be nested freely, making it easy to store even complex data while preserving its meaning.

In addition to table data, meta information can be added to each item, which is extremely useful for generative AI to understand the context. For example, by linking tag information, file paths or URLs where files are saved, it is expected that faster and more accurate answers will be obtained.

Structuring unstructured data

With the increase in unstructured data, various technologies such as OCR and speech recognition have been developed. These technologies go beyond simple text conversion and can analyze the intent of documents and speech, and recognize their structure.

Document Structuring

For data such as paper documents and scanned data such as PDFs, text information is first extracted using AI-OCR so that the text can be treated as data.

The extracted text is structured using LLM or similar into Markdown notation, QA format, etc. Some AI-OCRs extract text information and structure it at the same time.

Image structuring

Multimodal LLM, which can interpret data in a variety of formats, identifies objects in the image data. Based on the identified information, labels that clearly indicate what the image is and captions that provide explanations are generated and attached as metadata.

Instead of handling the image data itself, by replacing it with text information such as labels and captions, it can be treated in the same way as documents and other files when searching and responding.

Structuring Audio

It is important to first convert audio data into text using speech recognition, and some recent conferencing systems have the ability to automatically transcribe the data.

The audio data converted into text information is then subjected to speaker separation (recognizing and separating the people involved in the conversation) and sentiment analysis (labeling the emotional status of the comments, such as whether they are positive or negative) using LLM and speech analysis tools, and summaries and keywords are extracted and added as metadata.

Integration into data pipelines

To handle large amounts of data centrally and automatically, not only is a structured process required, but a system that integrates the entire process from data collection to AI utilization is also required.

Recently, efforts are underway to go beyond simply structuring data and automate the collection and preprocessing of information across the entire system. By simultaneously collecting and structuring data and efficiently dividing it into chunks, even large amounts of data can be referenced by the generative AI instantly.

Automated structuring

Manual preprocessing of huge amounts of unstructured documents and multilingual data is simply not possible. By combining automatic OCR, speech recognition using multimodal LLM, and labeling, we can achieve flexible processing regardless of the amount of data.

Recently, cloud services that integrate AI-OCR and structuring functions have become available in API format. By utilizing these structured APIs, it becomes possible to put internal data to practical use more quickly.

Automation from data collection to AI integration

Simply automating structuring will still require a person to collect and register data each time it is updated.It is important to develop a "data pipeline" that can manage the entire data processing process, from data collection and pre-processing such as structuring, to linking it to AI, such as registering it in a vector database.

Each process in the data pipeline is sequence-controlled and monitored synchronously. If an error occurs during a process, the error data is not registered in the AI, but is instead notified to the person in charge. This system maintains data quality and prevents the loss of data that the AI should reference.

iPaaS that enables data pipelines for AI

iPaaS (Integration Platform as a Service) is a platform that visualizes data integration between different systems and streamlines operations. It allows you to flexibly build data flows in both cloud and on-premise environments, and provides data in real time for use in generative AI.

In addition to structuring, by incorporating data cleansing and deduplication into the workflow of iPaaS, you will be able to constantly manage the status of the data being passed to AI. By utilizing such a system, you will be able to extract even greater value from structured data.

HULFT Square, a Japanese iPaaS, provides processing templates for various AI applications, including structuring.

"HULFT Square" pre-processing template accelerates AI utilization

HULFT Square offers practical application templates that can be used as is, with no coding required, by packaging common AI pre-processing tasks that are considered difficult to do yourself in a usable format.

▼I want to know more about iPaaS
⇒ iPaaS | Glossary

Finally

In this article, we introduced the importance of data structuring and preprocessing to improve the accuracy of generative AI responses, as well as specific techniques.

Data structuring is an essential process for correctly analyzing the large amount of files a company holds and conveying the context without misunderstandings. Even unstructured data can be automatically converted into text by combining OCR and speech recognition, making it easier for AI to interpret and use.

By using iPaaS as a data pipeline, you can operate a system that continuously improves data quality. As an approach to improving response accuracy, we encourage you to try data preprocessing with iPaaS.

The person who wrote the article

Affiliation: Data Integration Consulting Department, Data & AI Evangelist

Shinnosuke Yamamoto

After joining the company, he worked as a data engineer, designing and developing data infrastructure, primarily for major manufacturing clients. He then became involved in business planning for the standardization of data integration and the introduction of generative AI environments. From April 2023, he will be working as a pre-sales representative, proposing and planning services related to data infrastructure, while also giving lectures at seminars and acting as an evangelist in the "data x generative AI" field. His hobbies are traveling to remote islands and visiting open-air baths.
(Affiliations are as of the time of publication)

The key is structuring data! Preprocessing to improve the accuracy of generative AI responses

What is the structure of the data?

Structured data

Unstructured Data

semi-structured data

Why Data Structures are Important for Generative AI

Reason 1: Understanding the data and understanding the context

Reason 2: Ensuring consistency and accuracy of output

Reason 3: Processing efficiency and cost optimization

Structured format example

Format example ① Markdown notation

Format example 2: QA format (question and answer pair)

Format ③ JSON/XML format

Structuring unstructured data

Document Structuring

Image structuring

Structuring Audio

Integration into data pipelines

Automated structuring

Automation from data collection to AI integration

iPaaS that enables data pipelines for AI

Finally

The person who wrote the article

Shinnosuke Yamamoto

Recommended Content

What is Metadata? A comprehensive guide from the basics to the latest trends

RAG's data governance realized through iPaaS

Moving beyond PoC! What is the "data pipeline" that supports RAG's production operations?

Related Content

Why are SoR and SoE separated? – Solving the operational challenges of "bimodal IT" through data integration.

How to use generative AI to translate survey analysis into next actions

Systematically organize the causes and solutions for data inconsistencies