Vector database

Glossary

"Vector database"

This glossary explains various keywords that will help you understand the mindset necessary for data utilization and successful DX.
This time, let's consider the technical points involved in utilizing generative AI and deep learning, which are currently attracting a lot of attention.

What is a vector database?

A vector database is a data infrastructure that handles "vectorized data" commonly used in neural networks. It is also sometimes called a vector store.
Neural networks input and output numbers, so in order to input and use data from outside, the data must be vectorized (quantified). When using a large amount of data, you will be dealing with a large amount of vectorized data.
Since generative AI is also realized using neural networks, vector databases are sometimes used to realize RAGs, etc.

>Reference:
⇒ Vectorization / Embedding | Glossary

A data infrastructure that utilizes "vectorized data" is necessary

At the time of writing, generative AI such as ChatGPT is a hot topic, and vector databases are attracting attention as a useful technology for utilizing generative AI and neural networks in general.

For more details, please refer to the "Vectorization" article, but anyway, if you want to input data into a generative AI and have it process it, you need to vectorize that data. This is because neural networks (deep learning is also a type of neural network) take numbers as input and output numbers.

For more information:
⇒ Vectorization / Embedding | Glossary

If you want to use generative AI to utilize the various data your company possesses, you will need to (roughly speaking) vectorize that data and convert it into numerical data. If you want to utilize the various data your company possesses, you will need to utilize large amounts of vectorized data.

Just as general databases allow for the smooth utilization of large amounts of data, vector databases are databases designed to allow for the smooth utilization of vectorized data.

Differences from general databases

Numerical values can also be handled in widely used general databases such as RDB. Vector data is essentially a "collection of numerical values," so if you record the numerical values as an "array type," you can store and use them in commonly used databases. In other words, vectorized data itself can be stored in conventional databases.

What makes vector databases different from conventional databases is that they have functions tailored for use in neural networks such as generative AI. For example, many vector databases have a function that calculates the similarity of vectors and performs searches and other operations, which is often not found in general databases.

Example of general database usage: Search for data from an employee database where the employee's branch office is in Osaka and the department is the sales department.
Example of using a vector database: Search for and output 10 vectors that are similar to this vector in descending order of similarity

The process of "searching for data with similarities" can be performed using SQL in general databases, but it often requires scanning the entire table, which makes the process inefficient and complicated to implement, so it cannot be performed well in existing databases.

Therefore, a database has been created that is designed from the beginning to have the processing capabilities for vector data, which is often necessary when using generative AI or neural networks, and can run efficiently; this is the vector database.

What is "similarity"?

Vector databases are sometimes introduced with slogans (hype) such as "you can search for things that are similar in meaning." This may make it seem like a "difficult process" to calculate similarity. However, in most cases, it is just a simple calculation performed on vector data.

The methods for calculating similarity differ depending on the vector database, but we will introduce some typical (and easy to understand) methods. By the way, it won't be a problem if you don't understand the details, so feel free to skip ahead even if you don't like formulas.

Cosine distance (cosine similarity)

Think of the vectors in linear algebra you learned in high school math. Cosine distance (cosine similarity) calculates whether the vectors are pointing in the same direction. This is a commonly used measure of similarity.

Cosine distance = (dot product of vectors A and B) ÷ (magnitude of vector A × magnitude of vector B)

Think of vectors as arrows, and ignore the size of the arrows and only consider the direction they are pointing. If two vectors are pointing in exactly the same direction, they will have the maximum value of "1", and if the vectors are pointing in opposite directions, they will have the minimum value of "-1".

When using cosine distance to find similarities, you are essentially calculating "find the top 10 vectors that are pointing in the same direction as this vector."

Inner product (dot product)

The dot product itself can also be calculated as the similarity, which simplifies the calculation.

Dot product distance = (dot product of vectors A and B)

If the stored vectors are normalized (pre-adjusted) to have a magnitude of 1, simply calculating the dot product will produce the same result as cosine similarity.

Euclidean distance

The "calculation of distance in space" that we learned in school can also be used to calculate similarity. Sorry, the notation is a little confusing, but it is the same as the formula for "distance on a plane" or "distance in space."

Euclidean distance = √((A1 – B1) squared + (A2 – B2) squared + …)

While this is a familiar formula you learned in school, it's a bit more complicated and less efficient than the ones we've listed so far. When you use Euclidean distance to find similarities, you're essentially calculating "find the top 10 vectors that are spatially closest to this vector."

Many vector databases support "cosine distance," "inner product," and "Euclidean distance."

Manhattan Distance

I will also introduce other distances (similarity) that can be calculated with "simple calculations."

Manhattan distance = |A1 – B1| + |A2– B2| + …

It is simply the sum of the differences (absolute values) for each component. It is called Manhattan distance because it calculates the travel distance in cities like Manhattan in New York or Heiankyo in Kyoto, where roads run in a grid pattern only from east to west, north to south, and diagonal movement is not possible. It is easy to calculate, and determines similarity somewhat similarly to Euclidean distance.

Hamming distance

There is also an even simpler way to calculate it: if the values of each component of the vector match, it is set to "1", and if they do not match, it is set to "0", and the sum is calculated; in other words, the number of matching components is the similarity.

Hamming distance = number of matching components

If the vector values are only two values, 0 and 1, calculating whether they match (just XOR) becomes very easy.

There are many other ways to calculate similarity.

Why does this calculation amount to calculating semantic similarity?

Vector databases are often introduced as something amazing, such as "a great technology that can find text with similar meaning" or "it can also be used for recommendation engines." However, as explained above, they actually only do a little more than calculate an inner product.

"Vectorization" makes this amazing thing happen

There may be a big gap between the simple calculations that are actually performed and the idea that "you can search for text data with similar meanings." In fact, it is not the vector database itself that does something amazing, but the "process of vectorizing (quantifying) the original data" that does something amazing. For example, if you can "search for text data with similar meanings," then

Vectorization: Text data → Vector (numerical value)
- Convert the input text data into vector data such that "text with similar meanings has a close cosine distance"
Vector database: stores a large amount of vector data
- It can quickly find vector data with close cosine distances from a large number of stored vector data.

In other words, it's not the vector database, but the vectorization (quantification) that does something amazing. It converts data into "mysterious numbers that allow you to calculate semantic similarity just by calculating the dot product." A vector database can be said to be a data infrastructure that allows you to smoothly utilize the vector data created by conversion.

Example of converting "similar" into a calculable number

So how is "meaning" handled in this vectorization? Let's take "Word2Vec," a Google-developed tool that vectorizes (quantifies) the "meaning of words," as an example.

Based on the hypothesis that the meaning of a word is determined by the distribution of words that appear before and after it, Word2Vec is trained on a large amount of text data, with the criterion that if the words that appear before and after a word are similar, then the meaning is also similar. Therefore, it does not understand meaning in the same way as humans, but simply converts it into a numerical value based on that hypothesis.

Input: word
Output: Vector data
Properties of vector data: The more similar the words that appear before and after it are, the closer the cosine distance is to 1, and the less similar the words are, the closer the value is to -1.
Word2Vec: A neural network trained to obtain "an output that satisfies the above properties" when given input using a large amount of text data as training data.

Because it is a neural network, humans cannot understand the internal logic and processing that is used to convert it. Also, regarding the output, although "the output vector data certainly has the specified properties," humans cannot understand what the individual values of the output vector mean, and it appears as a large number of mysterious numbers lined up.

Most vectorization processes output incomprehensible data, but by calculating the cosine distance, you can determine whether the data is similar in meaning, and the data will have "specified properties," which can be useful in vector databases.

Other applications of vectorization

The realization of RAG utilizes the function of "searching for text data with similar meanings." This also utilizes a vectorization process that is trained to define the semantic similarity of text (documents) based on "some hypothesis," and then convert similarly similar data into vector data with close distances.

You may have heard that "recommendation systems can be created." For example, this is realized by vectorization processing, which is made by learning from actual purchase data, so that if a person who "bought products A, B, and C" is likely to buy "product D," the distance will be close, and if they are unlikely to buy "product E," the distance will be far.

I think there are other things you can do with it, such as "finding similar images."

"Find" or "make" your own vectorization process that can achieve what you want by calculating cosine distance, Euclidean distance, etc.
Vectorize the data you want to process using "Vectorization Processing"
By storing vectorized data in a vector database, it becomes possible to search for similar data using cosine distance, Euclidean distance, etc.

In other words, the vectorization process that should be used varies depending on the needs.

A large number of vector databases have been developed

Currently, a large number of vector databases are being developed around the world. At the time of writing, a product called Pinecone seems to be well known, but new products with new features are appearing every day, and GAFA companies, for example, are also developing their own products and providing services.

Furthermore, if there is a need in the future to calculate vector data, which differs from the currently mainstream process of calculating similarity using cosine distance, the functions provided by vector databases may change significantly from what they are now, and the products that are widely used may also be different from those they are today.

Furthermore, it is becoming possible to use vector data based on existing databases. For example, by adding the extension "pgvector" to the widely used PostgreSQL, it can be used as a vector database, and support for traditional databases is also progressing.

When using an existing database, such as PostgreSQL, as a vector database, it can also be used as a normal RDB. Similarly, since it has expanded functions from the original RDB, it can also be used as a NoSQL database or to process geospatial data, making it convenient to combine various usages.

When high performance is required, such as when using large amounts of vectorized data, it is necessary to use a dedicated product to ensure sufficient performance, but when this is not the case, it is necessary to use a product based on an existing database that you are familiar with. It will be necessary to make decisions based on the purpose and scale of use.

Establishing a "data usage environment" necessary for utilizing vector databases

Finally, let's consider the perspective of "utilizing data to achieve business results."

Introducing a vector database is not a magical thing that will do amazing things just by itself. You need to prepare the data according to what you want to do, vectorize the data according to its purpose, and then input it into the appropriate vector database for use.

We have also mentioned that when using vectorization, you need to choose the vectorization process (and vector database) to use depending on what you want to do. Furthermore, you also need to retrieve the data required for that purpose.

There is a wide variety of data
- Tabular data, text data, image data, video data, etc.
Stored in various locations
- Excel file stored on the work site PC
- Normalized and stored in a database (RDB)
- Fixed-length data in mainframes
- Files are scattered across the company's shared file server
- Cloud object storage is being thrown into the mix as a data lake.

As you can see, real-world data exists in a wide variety of forms, and it is necessary to retrieve the necessary data from data scattered throughout the company. Furthermore, vectorization is not possible without appropriate preprocessing in accordance with the vectorization service being used.

There are many different data utilization needs from the business side, and while some can be achieved using vectorization, others cannot (e.g., visualizing sales using a BI tool), and there may be cases where it is necessary to combine vectorization with non-vectorization. It is necessary to think comprehensively from the perspective of data utilization as a whole, not just vector databases.

Technology itself is also changing every day. Until recently, there was no talk of using vector data, and new vectorization services and vector databases are being created every day.

In other words, data, usage needs, and technology are all constantly changing. To achieve business results, it is necessary to create an ever-changing "data usage environment" that can effectively combine and utilize these many constantly changing elements as needed.

"RAG Swamp" (The Reality of Using Vector Databases)

At the time of writing, I believe the most typical situation in which "vectorizing data and utilizing vector databases" is using it as a means to realize "RAG (Search Augmentation and Generation)."

⇒ Retrieval Augmented Generation (RAG) | Glossary |

When using vector databases in RAG's efforts, the perspective of "improving the data usage environment" is also very important.

RAG (Search Extension Generation)

Generative AI has become popular, and even if companies want to use it in their own companies, the specific ways in which it can be used tend to be limited to "aiding in idea generation" or "assisting with administrative work." Why is this the case? I think the main reason is that when you ask generative AI questions, the answers it gives are "answers based on general knowledge," and do not take into account your company's business or circumstances.

Therefore, RAG (Search Augmented Generation) is a highly anticipated approach that allows generative AI to provide answers that take into account the company's business operations by providing the generative AI with knowledge based on the text entered by the user as well as related internal documents.

search:
- Search a vector database based on user-entered text
- Retrieve internal documents related to the user's input (question) from the search results
Extended Generation:
- In addition to user input, the content of the searched document is input to the generative AI (LLM) using In-Context Learning to obtain a response.

To achieve this, it is necessary to "pre-process a wide variety of data, vectorize it, and store it in a vector database."

Read a wide variety of documents and data sources scattered across the enterprise
It is then "properly preprocessed" and "properly vectorized" and stored in a vector database.
Search vector databases to find relevant documents

Unfortunately, it seems that RAG often does not achieve sufficient performance (response accuracy) by simply implementing the above mechanism. It is a very frustrating situation when a RAG system is built with high expectations but cannot be used in a practical manner.

In such cases, in order to "improve the accuracy of answers," you may be forced to continue a long, unpredictable, tedious trial and error process regarding "how to preprocess" and "how to vectorize." This difficult situation has come to be called the "RAG swamp."

The "connecting" technology necessary to fight the "RAG Swamp"

To begin with, a vector database simply stores vector data (numerical values) and calculates dot products, cosine distances, etc. Therefore, not just RAG, whether or not a vector database can realize its potential and be used effectively will depend on whether the surrounding data usage environment can be improved.

Furthermore, like the "RAG Swamp," there are cases where "repeated, tedious trial and error" is necessary to achieve results. In such cases, it is important not only to establish a data usage environment, but also to "establish a data usage environment that allows repeated trial and error with a wide variety of data sources."

Furthermore, to achieve results, it is often desirable to have people on the front lines who have a good understanding of the business involved, which means that it is also desirable to have a "no-code data usage environment" that is easy for non-engineers to use.

Please utilize "connecting" technology

There are methods that allow you to efficiently develop data environments that connect to a wide variety of systems and data on the cloud, read, process, and transfer data as needed, all with just a GUI.These are "connecting" technologies such as "DataSpider" and "HULFT Square," also known as "EAI," "ETL," and "iPaaS."

Can be used with GUI only

Unlike regular programming, there is no need to write code. By placing and configuring icons on the GUI, you can achieve integration with a wide variety of systems, data, and cloud services.

Being able to develop using a GUI is also an advantage

No-code development using only a GUI may seem like a simple compromise compared to full-scale programming. However, being able to develop using only a GUI allows on-site personnel to proactively work on cloud integration themselves. On-site personnel are the ones who know the business best.

Full-scale processing can be implemented

There are many products that claim to allow development using only a GUI, but some people may have a negative impression of such products as being too simple.

It is true that things like "it's easy to make, but it can only do simple things," "when I tried to execute a full-scale process it couldn't process and crashed," or "it didn't have the high reliability or stable operating capacity to support business operations, which caused problems" tend to occur.

"DataSpider" and "HULFT Square" are easy to use, but also allow you to create processes at the same level as full-scale programming. They have the same high processing power as full-scale programming, as they are internally converted to Java and executed, and have a long history of supporting corporate IT. They combine the benefits of "GUI only" with full-scale capabilities.

No need to operate in-house as it is iPaaS

DataSpider can be operated securely on a system under your own management. With HULFT Square, a cloud service (iPaaS), this "connecting" technology itself can be used as a cloud service without the need for in-house operation, eliminating the hassle of in-house implementation and system operation.

Related keywords (for further understanding)

Machine learning related keywords

Keywords related to Generative AI/ChatGPT

Keywords related to data integration and system integration

EAI
- It is a concept of "connecting" systems by data integration, and is a means of freely connecting various data and systems. It is a concept that has been used since long before the cloud era as a way to effectively utilize IT.
ETL
- In the recent trend of actively working on data utilization, the majority of the work is not the data analysis itself, but rather the collection and preprocessing of data scattered in various places, from on-premise to cloud.
iPaaS
- A cloud service that "connects" various clouds with external systems and data simply by operating on a GUI is called iPaaS.

Are you interested in "iPaaS" and "connecting" technologies?

Try out our products that allow you to freely connect various data and systems, from on-premise IT systems to cloud services, and make successful use of IT.

The ultimate "connecting" tool: data integration software "DataSpider" and data integration platform "HULFT Square"

"DataSpider," data integration tool developed and sold by our company, is a "connecting" tool with a long history of success. "HULFT Square," a data integration platform, is a "connecting" cloud service developed using DataSpider technology.

Another feature is that development can be done using only the GUI (no code) without writing code like in regular programming, so business staff who have a good understanding of their company's business can take the initiative to use it.

Try out DataSpider/ HULFT Square 's "connecting" technology:

There are many simple collaboration tools on the market, but this tool can be used with just a GUI, is easy enough for even non-programmers to use, and has "high development productivity" and "full-fledged performance that can serve as the foundation for business (professional use)."

It can smoothly solve the problem of "connecting disparate systems and data" that hinders successful IT utilization. We regularly hold free trial versions and hands-on sessions where you can try it out for free, so we hope you will give it a try.

Free product introduction seminar

Why not try a PoC to see if "HULFT Square" can transform your business?

Why not try verifying how "connecting" can be utilized in your business, the feasibility of solving problems using data integration, and the benefits that can be obtained?

I want to automate data integration with SaaS, but I want to confirm the feasibility of doing so.
We want to move forward with data utilization, but we have issues with system integration
I want to consider data integration platform to achieve DX.

PoC Program | HULFT Square

"Vector database"

What is a vector database?

A data infrastructure that utilizes "vectorized data" is necessary

Differences from general databases

What is "similarity"?

Cosine distance (cosine similarity)

Inner product (dot product)

Euclidean distance

Manhattan Distance

Hamming distance

Why does this calculation amount to calculating semantic similarity?

"Vectorization" makes this amazing thing happen

Example of converting "similar" into a calculable number

Other applications of vectorization

A large number of vector databases have been developed

Establishing a "data usage environment" necessary for utilizing vector databases

"RAG Swamp" (The Reality of Using Vector Databases)

RAG (Search Extension Generation)

The "connecting" technology necessary to fight the "RAG Swamp"

Please utilize "connecting" technology

Can be used with GUI only

Being able to develop using a GUI is also an advantage

Full-scale processing can be implemented

No need to operate in-house as it is iPaaS

Related keywords (for further understanding)

Machine learning related keywords

Keywords related to Generative AI/ChatGPT

Keywords related to data integration and system integration

Are you interested in "iPaaS" and "connecting" technologies?

The ultimate "connecting" tool: data integration software "DataSpider" and data integration platform "HULFT Square"

Glossary Column List

Alphanumeric characters and symbols

A row

Ka row

Sa row

Ta row

Na row

Ha row

Ma row

Ya row

Ra row

Wa row

Recommended Content

Related Content

MFT（Managed File Transfer）

Transfer learning

Elastic (elasticity and flexibility)