What is chunking that improves the accuracy of RAG document search?

  • Data Utilization
  • Generation AI

Techniques such as Retrieval-Augmented Generation (RAG) and AI agents that effectively utilize data held by companies with generative AI are gaining attention. In these systems, "vector search" plays a key role in locating necessary information.
The accuracy of vector search depends heavily on the "chunk" of data that is searched.
In this article, we explain chunk division techniques to improve the accuracy of vector searches and data integration methods to continuously create effective chunks.

Utilizing generated AI data

Chunking Overview

Chunking is a data preprocessing technique for vector search that divides large amounts of unstructured data (documents, images, audio, etc.) into smaller chunks called "chunks." Chunks define the units of information stored in the vector database for vector search, and are an important factor in determining search accuracy.

First, we will explain why chunking is necessary and the basic elements that should be considered when chunking.

Why chunking is necessary

For each model, generative AI has a set upper limit on the amount of data that can be handled for input and output in a single generation, called the "context window." For this reason, it is not realistic to handle huge amounts of text data, such as tens or hundreds of pages, as is. When generative AI handles data, it is necessary to select the necessary information and limit the amount of data according to the size of the context window.

Limiting data to only relevant information rather than the full extent of it also plays an important role in reducing hallucination.

For example, suppose a user asks a generative AI a question and has it answer based on a PDF business report. The business report consists of five chapters, each written by a different author on a different topic. If the information the user is looking for is in chapter 3, if the generative AI is given the entire report, completely unrelated chapters other than chapter 3 will also be included in the answer as noise.

It's important to break down information into appropriate chunks and minimize noise to arrive at the desired answers.

▼I want to know more about generative AI
Generative AI | Glossary

Size and overlap

The key fundamental elements in designing chunking are "size" and "overlap." The settings of these parameters affect the accuracy of the search and the quality of the final answer.

size

Size is the maximum number of characters or tokens of information that make up one chunk. The optimal chunk size varies depending on the use case and the context window of the LLM.

Setting a small size will split the data into smaller chunks. Small chunks are useful for use cases where you want to pinpoint specific words, such as proper nouns or keywords. However, because the information is limited, it is not suitable for use cases where you need the overall context or overall understanding of the data.

On the other hand, increasing the size will broaden the amount of information obtained during the search, making it suitable for use cases that require the entire context and overall understanding of the data. However, this may result in the inclusion of information from a completely different context or information that constitutes unexpected noise. Also, if the context window of an LLM is limited, it is necessary to limit the number of chunks (Top-k) obtained as search results.

▼I want to know more about LLM
Large Language Model (LLM) | Glossary

Overlap

Overlap is a value that indicates how much overlapping information is allowed between chunks. When data is divided into small chunks, the information in each chunk becomes fragmented, and there is a risk that the original context may be lost. It is important to maintain the continuity of context between chunks by setting an appropriate overlap. The appropriate overlap range also varies depending on the use case.

For example, in tasks such as pinpointing information extraction from FAQs, extensive overlap is not required, as it is sufficient to extract the parts that contain the necessary information.

On the other hand, when it is necessary to maintain the overall context of the data, such as when searching for content with a narrative or technical logical structure, it is necessary to ensure continuity of the information by providing a certain amount of overlap between chunks.

Main chunking approaches

There are several main methods for chunk division based on the approach. Each method has its own advantages and disadvantages as well as the data structure it is best suited for, so it is important to strategically select a method based on its characteristics.

img_column_data-utilization-chunking_01.png

Fixed-Length Chunking

Fixed-length chunking is the simplest approach to splitting data by dividing it into a specified number of characters or tokens.

The chunk size is fixed and the division is automatic, so the computational cost is low and it can be implemented simply. On the other hand, because the division is automatic and fixed size, the context and meaning of the information are ignored, so there is a risk that information may be cut off within the chunk.

Recursive chunking

Recursive chunking is an approach that splits data by separators, which are hierarchical delimiters such as paragraphs or words.

This approach assumes that there are appropriate separators in the data. If separators exist, chunks can be divided at regular intervals such as paragraphs, making it possible to maintain context as much as possible. However, if there are no clear separators or hierarchical structures, it becomes difficult to achieve sufficient results.

Structure-Aware Chunking

Structure-aware chunking is an approach that divides chunks using the logical structure (headings, lists, etc.) of data such as documents. It is particularly effective for structured data expressed in formats such as Markdown.

Structure-aware chunking understands the structure of the data before dividing it, so it is possible to logically preserve the context. Therefore, it cannot be applied to data that does not have a clear structure.

Semantic chunking

Semantic chunking is an approach to dividing data based on semantic units. While recursive chunking and structure-aware chunking create meaningful chunks by dividing data based on data structures such as separators and paragraphs, semantic chunking derives more sophisticated semantic units.

Specifically, by converting the data into vectors and calculating the similarity, clear similarity boundaries are identified as chunk division points. Calculating similarity in vectors makes it possible to create chunks that take into account the meaning of the sentence. However, since the entire data must be vectorized and calculated, a corresponding processing cost is required.

Data integration for improved accuracy

So far, we have explained the concept and approaches of chunk division to improve search accuracy in vector search. It is important to strategically select chunk division parameters (size, overlap, etc.) and division approaches (fixed-length chunking, etc.) depending on the use case (how you want to handle the data with generative AI) and the data structure.

However, chunking strategies are not limited to designing chunk division. Data processing design is also an important factor, considering how to collect the data to be divided in the first place, and what preprocessing is necessary to improve search and response accuracy.

Data collection

Generally, data that businesses want to utilize is managed in different places. For example, documents such as internal regulations and procedures are generally managed on a company's file server or cloud storage. The more extensive the data that needs to be handled, such as proposals, design documents, and reports, the less realistic it becomes to collect it individually and split it into chunks.

For example, when using a cloud-based vector search service such as Amazon Bedrock for Knowledge Base, Azure AI Search, or Vertex AI Search, you need to consider how to transport the data to the cloud where the vector search is running.

There are various options for collecting data depending on the system environment in which the data you want to use is stored. For example, if it is a file, you can consider integration via FTP, SFTP, or HULFT. If it is a database, you can consider integration via ODBC or JDBC, or if a dedicated API is provided, API integration.

Adding metadata

After collecting the data, we add secondary information to it, called metadata (data about the data), which improves the accuracy of vector searches and allows specific information to be taken into account when generating the final LLM answer.

Adding tag information to metadata is an effective way to improve search accuracy. Tag information simply classifies data based on specific criteria, such as whether the data is related to "finance" or "human resources," "proposal" or "procedure," "manufacturing" or "finance." This information does not need to be organized in advance; it can be dynamically tagged in conjunction with LLM.

Information that should be taken into consideration when answering questions includes the update date and the path or URL that indicates the location of the data. The update date indicates the current date of the data. By knowing the update date, LLM will be able to recognize whether the information is new or old, and provide answers that take the freshness of the information into consideration. By presenting the path or URL when answering questions, users will be able to trace back to the original data. This is useful when there is a need to see the original data in addition to the LLM's answers.

Structuring Data

In addition to metadata, the approach to the data itself is also important. Recursive chunking and structure-aware chunking, as mentioned above, require a clear structure within the text. However, most data does not have structures or separators that are easy for LLM to recognize.

Data structuring can be achieved by integrating data with OCR and LLM. For example, by integrating with Azure Document Intelligence, PDF documents can be converted to Markdown notation.

If we can convert it to Markdown syntax, we can perform recursive chunking using the "#" that indicates a heading as a separator, or we can perform structure-aware chunking using the Markdown structure itself.

iPaaS for data integration

When considering a chunking strategy, it is important to design the data processing that involves collecting data from the source, performing the necessary preprocessing and adding metadata, and then passing it on to the vector search service.

This data processing requires collaboration with various systems and services, such as file servers and cloud storage that manage the data, LLM and OCR for processing the data, and vector search services.

The data integration platform "iPaaS (Integration Platform as a Service)" can integrate various systems and services and realize data processing as a centralized pipeline.

iPaaS-based data integration platform HULFT Square

iPaaS-based data integration platform HULFT Square

HULFT Square is a Japanese iPaaS (cloud-based data integration platform) that supports "data preparation for data utilization" and "data integration that connects business systems." It enables smooth data integration between a wide variety of systems, including various cloud services and on-premise systems.

▼I want to know more about iPaaS
iPaaS | Glossary

Finally

This time, we focused on chunking, which improves the accuracy of vector searches, which are important when using generative AI to utilize internal and external data for business purposes, and introduced the basic elements and approaches of chunking, as well as key points for data integration.

When using RAGs and AI agents, attention tends to be focused on business use cases and output, but continuous efforts to improve accuracy are essential for stable operation. Strategic chunking and data integration will play an important role in the sustainable development of such generative AI applications.

The person who wrote the article

Affiliation: Data Integration Consulting Department, Data & AI Evangelist

Shinnosuke Yamamoto

After joining the company, he worked as a data engineer, designing and developing data infrastructure, primarily for major manufacturing clients. He then became involved in business planning for the standardization of data integration and the introduction of generative AI environments. From April 2023, he will be working as a pre-sales representative, proposing and planning services related to data infrastructure, while also giving lectures at seminars and acting as an evangelist in the "data x generative AI" field. His hobbies are traveling to remote islands and visiting open-air baths.
(Affiliations are as of the time of publication)

Recommended Content

Related Content

Return to column list