Data exploration in the AI era: Vector search techniques and data integration methods explained

  • Data Utilization
  • Generation AI
Data exploration in the AI era: Vector search techniques and data integration methods explained

As generative AI is increasingly used in business, the use of unstructured data (text, images, audio, etc.) that cannot be handled by conventional keyword searches has become an issue. "Vector search" fundamentally solves this issue and makes data exploration possible in the age of AI.
This time, we will explain vector search techniques that support the AI era, as well as how to data integration between data sources where data assets lie dormant and vector databases.

What is a vector/vector database?

To understand vector search, it is first important to understand what a vector is and how a vector database, which stores vectors, differs from a traditional database.

▼ Want to know more about vector databases?
Vector database | Glossary

What is a Vector?

A vector is a numerical representation of unstructured data such as text, images, audio, or video.

For example, the RGB color model, which represents the intensity of red, green, and blue light, can be broadly considered a three-dimensional vector representation. In the RGB color model, red (R), green (G), and blue (B) each take values between 0 and 255, and colors are expressed by balancing these values.

For example, the color name "red" can be expressed as the vector (230,0,51), and the color name "blue" can be expressed as (0,149,217). Red is characterized by a large R value, and blue is characterized by a large B value. Conversely, (136,72,152), which is a value close to the middle between red and blue, represents the color name "purple." In this way, the numbers expressed as vectors have meaning.

Additionally, one of the characteristics of vector representation is that even if words are different, if they have similar meanings, the numerical values will be close. Taking the RGB color model as an example, the color names "light blue" and "sky blue" are completely different words, but they can be used to imagine similar colors. In fact, when expressed in RGB, light blue has similar values (188,226,232) and sky blue has similar values (160,216,239).

In actual vector representations, words are expressed in hundreds or thousands of dimensions, and completely different words are arranged based on their semantic similarity. The process of replacing words with this vector representation is called "embedding."

What is a vector database?

A vector database is a database specifically designed to store vector-represented data, create indexes for searching, and execute search queries.

Traditional databases include relational databases with a schema consisting of rows and columns, such as transaction information or customer information, and NoSQL databases that can handle large amounts of unstructured or semi-structured data with flexible schemas, such as social media posts. While these traditional databases can process queries based on predefined structures (schema, keys, etc.), they are not suitable for searches based on semantic patterns or similarities.

A vector database converts data into a numerical representation called a vector, making it possible to process queries based on numerical proximity calculations. In other words, even if the vocabulary and expressions used for each piece of data are different, it is possible to obtain search results if the data are similar in meaning.

With the advent of generative AI, "natural language," the language humans use on a daily basis, has become commonplace in data utilization. Natural language uses different vocabulary and expressions depending on the speaker. In other words, it is characterized by constant variation in spelling. A vector database can be said to be a search mechanism that can flexibly respond to this "variation."

Search Methods

Now that we've introduced the concepts of vectors and vector databases, what search methods are there? We'll introduce traditional keyword search, semantic search, which overcomes the challenges of keyword search, and hybrid search, which combines the characteristics of both.

Keyword search

Keyword searches search for partial or exact matches to the words or phrases entered by the user, or based on the frequency of occurrence of a word. This search method is effective when the data to be searched is expressed in a strict manner or the keywords to be searched are known.

One of the most notable challenges with keyword search is its ability to identify spelling variations and synonyms. For example, if a user searches for "telework," a keyword search may miss results that include words like "telework" or "remote work." Similarly, if a user searches for "manual," the search fails to recognize synonyms like "procedure manual" or "operation guide."

Additionally, keyword search does not take into account the user's context, making it difficult to respond to ambiguous searches. For example, a complex query with multiple intents, such as "evaluation system changes," may not find the specific information the user is looking for, such as "new evaluation criteria" or "how to conduct evaluation interviews," and may return results that differ from what the user expected.

The writing style of user questions and explanatory text such as internal company documents often differ, so keyword searches may not yield appropriate results.

Semantic Search

Semantic search (also known as "vector search"), as opposed to traditional keyword search, is a method of retrieving information based on the user's context and the meaning of words, and their similarities. This approach captures the meaning and intent behind the words being searched, making it possible to provide contextually relevant results even if the keywords are not an exact match.

Semantic search can handle not only text but also unstructured data such as images, audio, and video in a unified manner, making it useful for a wide range of AI applications, including recommendation, inquiry response, and multimodal search.

In recent years, this kind of semantic search technology has been expanding its possibilities by combining it with generative AI. A typical approach is  RAG (Retrieval-Augmented Generation)  is.

RAG combines a search system with LLM (generative AI) to dynamically generate answers based on external data sources. The answers that LLM can generate alone are limited to the knowledge it has gained through training data, so there is a problem that knowledge becomes outdated over time. LLM can also present false information that is not based on facts, a phenomenon known as hallucination.

RAG enables LLM to provide down-to-earth answers by "grounding" it in external, up-to-date, and reliable sources. At the heart of this search function is semantic search using a vector database.

Hybrid Search

While semantic search excels at understanding context, it has weaknesses when it comes to exact matching of specific proper nouns or keywords. Hybrid search is gaining attention as a way to overcome this limitation and improve search quality.

Hybrid search achieves a perfect balance between completeness and relevance by combining keyword and semantic search. Keyword search provides search results that match specific words exactly, while semantic search provides search results that match semantic similarity without requiring an exact match.

By combining and ranking the search results from both sources, we can improve search quality by matching keywords while taking into account semantic patterns and relevance associated with similarities.

Vector Search Process

Search methods, including vector search (semantic search, hybrid search, etc.), are not a single data processing step, but a pipeline consisting of multiple steps. To improve search accuracy, it is important to assess the efficiency and accuracy of the entire pipeline.

Vector Search Process

Vectorization (Embedding)

The first step is to convert any unstructured data to be searched into numerical vectors, using appropriate language and/or multimodal models depending on the type of data, such as text, images, or audio.

▼I want to know more about vectorization (embedding)
Vectorization / Embedding | Glossary

Indexing

The resulting large set of vectors is then stored for efficient lookup in a specialized data structure, the index, which is optimized for fast lookup of similar vectors.

Query Execution

When a user performs a search, the search query entered by the user is also vectorized. This vectorized query is compared with the indexed vectors in the database and returned as search results. This comparison process is based on the proximity (distance) between vectors, rather than an exact match of keywords.

Reranking

To improve search quality, reranking may be performed, which reevaluates the similarity between the user's input and the query execution results and ranks them accordingly. Reranking reevaluates the large number of search results obtained and narrows them down to the top few results provided to LLM, allowing LLM to generate answers based on more proximate information.

Integration with Data Sources

As mentioned above, in order to be able to search unstructured data by similarity, the data must be stored in a vector database through vectorization and indexing. However, the data we actually want to utilize is actually stored in other locations, such as file servers or cloud storage. Finally, we will introduce how data from these data sources can be integrated into a vector database.

Collection from data sources

First, consider how you can extract data from its source. In most cases, unstructured data is stored on file servers or in cloud storage.

If it's a file server, consider whether file integration is possible. Specifically, you can consider dynamic data integration using FTP/SFTP or HULFT, or implementing automatic export using RPA. For cloud storage, integration via REST API is common, although it depends on the service you use.

The timing of data collection depends on the business nature of the data. For example, if the data is updated daily and the latest data needs to be displayed as search results, daily batch processing will be necessary. Conversely, if the data is updated rarely or has little impact on the freshness of the information, monthly or yearly collection may be sufficient.

In addition, the amount of unstructured data generated in business is enormous, and even if it is updated daily, the amount of data processing is considerable. It is important to incorporate the concept of differential linkage and collect only the information that is necessary.

Registration in the vector database

After collecting data from data sources, it is registered (or updated) in a vector database. The method of registering data in a vector database varies depending on the vector database used, but we will consider two patterns: direct linkage and indirect linkage.

Direct integration patterns are divided into integration before vectorization and integration after vectorization. In the latter case, data must be converted into vectors using an embedded model before being registered in the vector database. Integration with a vector database can be achieved via JDBC or REST API, depending on the protocol supported by the vector database.

In the case of indirect integration, data can be stored in object storage supported by the vector database, and then imported into the vector database synchronously or asynchronously. For example, when using Microsoft Azure's Azure AI Search, data can be placed in Azure Blob Storage, allowing Azure AI Search to import data from Azure Blob Storage.

Integration with object storage can be achieved via a REST API or SDK, depending on the protocol supported by the object storage. The import schedule for the vector database can be controlled by the schedule function on the vector database itself, or, if an API is provided, you can issue an import command to import data synchronously.

Building a Data Pipeline

A data integration mechanism is important for collecting data from data sources and registering it in a vector database. By defining the data processing listed above as a data pipeline, it is possible to synchronously perform the series of processes required for registration in the vector database and provide consistent processing recursively on a daily basis.

iPaaS (Integration Platform as a Service), a cloud-based data integration platform, can integrate and stably provide a series of data supply processes required in the AI era, such as collecting, organizing, and linking scattered data. Data pipelines play a very important role as a mechanism for utilizing data assets lying dormant within a company and steadily supplying this data to AI.

Finally

This time, we focused on vector search, which is the key to utilizing unstructured data, and introduced the basic concept of search and the process of processing and linking data to a vector database for AI utilization.

There are various factors that can affect the accuracy of generative AI responses, such as RAGs and AI agents, and the selection of vector databases and search methods, as well as tuning the search process, are also important factors in improving accuracy.

At the same time, integrating data sources is essential for the stable use of AI in business. Data pipelines will be an important foundation for promoting the continuous and stable use of AI in business.

Saison Technology Online Consultation

Saison Technology Online Consultation

If you would like to hear more about our data utilization platform, we also offer online consultations.

Make an online consultation

The person who wrote the article

Affiliation: Data Integration Consulting Department, Data & AI Evangelist

Shinnosuke Yamamoto

After joining the company, he worked as a data engineer, designing and developing data infrastructure, primarily for major manufacturing clients. He then became involved in business planning for the standardization of data integration and the introduction of generative AI environments. From April 2023, he will be working as a pre-sales representative, proposing and planning services related to data infrastructure, while also giving lectures at seminars and acting as an evangelist in the "data x generative AI" field. His hobbies are traveling to remote islands and visiting open-air baths.
(Affiliations are as of the time of publication)

Recommended Content

Related Content

Return to column list