Vectorization/Embedding

Glossary

"Vectorization/Embedding"

This glossary explains various keywords that will help you understand the mindset necessary for data utilization and successful DX.
This time, let's take a look at technologies related to generative AI and deep learning, which are currently attracting a lot of attention.

What is Vectorization/Embedding?

Vectorization or embedding is the process of converting the data to be processed into numerical data when using neural networks in general, such as generative AI or deep learning.
Neural networks take numerical values as input and output numerical values, so if you want to process text data or image data, for example, you need to convert it into numerical data in some way. Data converted into numerical vectors is sometimes called an embedding.
Vectorization is also known to be used to achieve RAG.

Why is vectorization necessary?

First, let me explain what "vectorization" is and why you should use it.

This is an unavoidable process when using deep learning, let alone generative AI.

"Vectorization" is an important technology that is "inevitable" for utilizing data. It is sometimes necessary for the use of "generative AI," a hot topic these days, or more generally, "neural networks."

What image comes to mind when you hear the terms "generative AI" or "neural network"? They're called AI and are said to realize intelligence, so you might have a vague image of them doing something sophisticated and incomprehensible. However, in reality, they're simply something that "takes numbers in and outputs numbers."

Deep learning is also a type of neural network, and generative AI is also realized using neural networks. In other words, many of the things that are currently being talked about as "utilizing AI" are realized using neural networks.

In other words, if you want to use your company's data with AI, you need to convert that data into numbers in some way.

Neural networks take numbers as input
Generative AI and deep learning are neural networks
The data you want to input and process needs to be converted into numbers.

There is no problem if the data is numerical from the start, such as "Today's temperature" or "Last month's product sales," but if you want to process data that is essentially different in nature from numbers, such as text data (character strings) or images, you will not be able to enter it unless you convert it to a number.

"Vectorization" is the process of converting non-numeric data, such as text data, into a "sequence of numbers" (i.e., a vector).

I want to process text data, image data, etc., but I can only input numbers.
You have to convert it to a number somehow (vectorization)

In that sense, "vectorization" can be said to be an "unavoidable process" when utilizing data.

By converting data into numbers, it may be possible to carry out calculations etc.

There is also a positive meaning to "converting data into numbers." Converting data into numbers (or vector data) can sometimes enable "processing that is only possible with numbers."

However, please note that converting data to numbers does not mean that you cannot process it. Even if the data remains text, you can still count the number of characters, count the number of words, and perform search operations (searching for parts that match the entered string). Unless there are special circumstances, there is no inconvenience in leaving text data as text data.

However, by converting text data into numbers, you can sometimes do amazing things that you can't do with text. For example, by converting words into numbers using a technology called "word2vec" developed by Google, you can sometimes calculate their meaning by adding or subtracting them. For example,

King – Man + Woman = Queen

For example, if you convert the word "King" into a 100-dimensional vector (a collection of 100 numbers), and then convert other words into numbers in the same way, it is known to do some strange things, like subtracting "man" from "king" and adding "woman" to it, and the resulting number will roughly equal "queen," as if it were calculating the meaning.

Sometimes it is possible to "convert to numbers" so that the meaning of a word can be calculated.

There are some "amazing conversion processes" that can convert numbers into numbers with mysterious properties, so it can be useful to use such conversion processes.

Although the ability to calculate meaning through addition and subtraction has a big impact, you may wonder whether it can be used in practice. However, processes have been created to convert it into vector data that have practical use cases, and as a result, it is gaining attention in practice.

A typical example is the existence of many vectorization processes that have the property of "converting text data with similar meanings into similar numerical values." Such conversion processes can be used to determine the degree of similarity in content or to search for data with similar meanings.

Because an ecosystem for vectorized data is already in place

Because of this utility, an ecosystem is emerging around vectorized data.

For example, the development of "vector databases" as a platform for using vectorized data is progressing rapidly around the world (as of the time of writing). These databases store large amounts of vector data and allow for specific search processing (such as searching by cosine distance).

⇒ Vector database | Glossary

There are also growing expectations for its use in specific use cases, such as finding text data with similar meanings, finding similar images, or finding the next product a person is likely to look for based on their purchase history.

Vectorization is becoming useful and important as a means of access to such technological ecosystems.

Humans cannot interpret the meaning of the numbers in vectorized data

For example, when Word2Vec converts the meaning of a word into a 100-dimensional vector, it outputs "100 numbers" (word embeddings). The converted result is simply a "mysterious series of numbers" for humans, and they cannot understand what those numbers mean. In most cases, the vectorized result is incomprehensible to humans.

Word2Vec itself is a neural network. It was created by feeding data and training it based on the hypothesis that the meaning of a word is determined by the relationship between the words before and after it. As is often the case with deep learning, it is often difficult for humans to understand the processes and judgments that go into outputting the numerical values.

There is not just one kind of "meaning," nor is there the same understanding of meaning as a person.

In addition to Word2Vec, there are many other efforts to quantify the meaning of words and text data. These are sometimes used in RAG, and because they can quantify meaning, they are sometimes thought of as magical technology, but each simply converts "some assumptions" into numbers. They do not understand meaning in the same way humans do.

For example, Word2Vec simply converts words that have a similar "distribution of words that appear before and after a word in a sentence" into numbers on the assumption that they will have similar meanings. In other words, because it does not learn by referring to dictionary-like resources, it may not be able to process antonyms (words that have very different meanings but have a similar distribution before and after) or polysemous words (words that appear in various ways) well.

Rather than the meaning itself being extracted, it is simply converted into a number based on "certain assumptions," so it is also necessary to be aware of what assumptions are being made and what properties the numbers have.

How do we "vectorize" our data?

If you have text data and want to use it in a neural network or similar (or if you want to process the meaning), you will need to vectorize it using an appropriate method.

Basically, use external services

In most cases, rather than creating your own conversion process to vectorize, you use existing cloud services or existing software to perform the conversion process. In other words, you choose and use something that already exists.

Specifically, vectorization is performed using publicly available software such as Word2Vec, and conversion to numbers is performed using functions provided as cloud services.

For example, OpenAI, the developer of ChatGPT, provides an API (OpenAI Embedding API) that vectorizes text data, allowing you to use various OpenAI services with the converted numerical data. In this way, by using the vectorization APIs provided by various cloud services, you can obtain numerical data that can be used with those cloud services.

Even with the same input, the converted results will differ depending on the vectorization service. You will need to vectorize the data according to your purpose or the values expected by the cloud service you are using.

It may be used in RAG

Vectorization is sometimes used in RAG as a technique to find documents with similar meanings.

>For more information on RAG, click here.
⇒ Retrieval Augmented Generation (RAG) | Glossary

When used to realize RAG, vectorized data may be stored and used, but there are also databases designed specifically for handling vectorized data, called vector databases, which are created for such purposes.

What is necessary to utilize "vectorization"

We learned that "vectorization" can be used as a means of utilizing various data in neural networks, or as a means of utilizing it in various processes that become possible by digitizing it. In order to utilize vectorization in practice, it is necessary to first create a situation where "data can be retrieved and vectorized as needed."

Load by referencing the original data
Appropriate pre-processing is carried out according to the service used.
Call the API of a cloud service that performs vectorization conversion processing, pass the data, and obtain digitized data as the processing result.

Furthermore, we must create an environment in which vectorized data can be utilized.

Extract quantified data as needed
The data is passed to a service that uses it, such as using it in a neural network or processing and using the quantified data itself, to obtain the processing results.

In many cases, data is scattered throughout a company in a variety of different formats. It is necessary to create a situation where the necessary data can be accessed. Furthermore, it is necessary to properly preprocess the data, invoke the appropriate vectorization process, and freely combine and invoke APIs that use vectorized data.

"RAG Swamp"

This kind of preparation to "vectorize data and make it usable" is also necessary when implementing RAG (Search Augmentation Generation).

search:
- Search a vector database based on user-entered text
- Retrieve internal documents related to the user's input (question) from the search results
Extended Generation:
- User input and the contents of the searched documents are input to the generative AI (LLM) to obtain a response.

⇒ Retrieval Augmented Generation (RAG) | Glossary

To enable the above search process, internal documents must be vectorized in some way so that they can be searched by vector.

Import target internal documents located throughout the company
Pre-processing the document data and its contents, if necessary, to make the contents easier for the LLM to understand.
Divide the data into units to be vectorized. You need to carefully consider how to divide it and what units to use as a single block.
Call the API of a cloud service that performs vectorization, pass the data, and obtain quantified data.

It would be no problem if we could create a fully functioning RAG system simply by following the steps above (or by following standard RAG implementation procedures that have been established in the world). However, unfortunately, at the time of writing, this is not often the case.

There seems to be a lot of hope for vectorization (and RAG) as an amazing technology that can "search for documents with similar meanings," but when actually creating a system, the problem is that it often ends up forcing you to continue with "dirty tuning" because "RAG does not provide a practical response" or "vector search does not provide sufficient search accuracy."

This situation is sometimes called the "RAG swamp," and in such cases, in an effort to somehow improve accuracy, you may be forced to repeatedly try things like "performing new preprocessing," "trying a different preprocessing method," "trying different units for grouping," or "trying using a different vectorization service," and you may be forced to repeatedly perform data integration and data processing that involves trial and error.

Furthermore, data itself is often scattered across the company in a variety of formats (the documents required for RAG to generate an appropriate response to "that input" may not be in an easily accessible location or in an easily usable format), so ensuring sufficient access to the company's data in the first place can be a daunting task in itself.

Connecting technologies that are useful in creating an environment for utilizing data

In other words, in order to utilize the potential of "vectorization," it is necessary to create a "data utilization environment" that can effectively utilize the data scattered throughout the company, and also to create an environment that can smoothly and efficiently link and utilize cloud services and the like.

Please utilize "connecting" technology

There is a way to efficiently develop these efforts to connect with data on a wide variety of systems and clouds, read, process, and transfer data as needed, and improve the data environment using only a GUI.These are "connecting" technologies such as "DataSpider" and "HULFT Square," also known as "EAI," "ETL," and "iPaaS."

Can be used with GUI only

Unlike regular programming, there is no need to write code. By placing and configuring icons on the GUI, you can achieve integration with a wide variety of systems, data, and cloud services.

Being able to develop using a GUI is also an advantage

No-code development using only a GUI may seem like a simple compromise compared to full-scale programming. However, being able to develop using only a GUI allows on-site personnel to proactively work on cloud integration themselves. On-site personnel are the ones who know the business best.

Full-scale processing can be implemented

There are many products that claim to allow development using only a GUI, but some people may have a negative impression of such products as being too simple.

It is true that things like "it's easy to make, but it can only do simple things," "when I tried to execute a full-scale process it couldn't process and crashed," or "it didn't have the high reliability or stable operating capacity to support business operations, which caused problems" tend to occur.

"DataSpider" and "HULFT Square" are easy to use, but also allow you to create processes at the same level as full-scale programming. They have the same high processing power as full-scale programming, as they are internally converted to Java and executed, and have a long history of supporting corporate IT. They combine the benefits of "GUI only" with full-scale capabilities.

No need to operate in-house as it is iPaaS

DataSpider can be operated securely on a system under your own management. With HULFT Square, a cloud service (iPaaS), this "connecting" technology itself can be used as a cloud service without the need for in-house operation, eliminating the hassle of in-house implementation and system operation.

Related keywords (for further understanding)

Machine learning related keywords

Keywords related to Generative AI/ChatGPT

Keywords related to data integration and system integration

EAI
- It is a concept of "connecting" systems by data integration, and is a means of freely connecting various data and systems. It is a concept that has been used since long before the cloud era as a way to effectively utilize IT.
ETL
- In the recent trend of actively working on data utilization, the majority of the work is not the data analysis itself, but rather the collection and preprocessing of data scattered around, from on-premise to cloud. This is a means to carry out such processing efficiently.
iPaaS
- A cloud service that "connects" various clouds with external systems and data simply by operating on a GUI is called iPaaS.

Are you interested in "iPaaS" and "connecting" technologies?

Try out our products that allow you to freely connect various data and systems, from on-premise IT systems to cloud services, and make successful use of IT.

The ultimate "connecting" tool: data integration software "DataSpider" and data integration platform "HULFT Square"

"DataSpider," data integration tool developed and sold by our company, is a "connecting" tool with a long history of success. "HULFT Square," a data integration platform, is a "connecting" cloud service developed using DataSpider technology.

Another feature is that development can be done using only the GUI (no code) without writing code like in regular programming, so business staff who have a good understanding of their company's business can take the initiative to use it.

Try out DataSpider/ HULFT Square 's "connecting" technology:

There are many simple collaboration tools on the market, but this tool can be used with just a GUI, is easy enough for even non-programmers to use, and has "high development productivity" and "full-fledged performance that can serve as the foundation for business (professional use)."

It can smoothly solve the problem of "connecting disparate systems and data" that hinders successful IT utilization. We regularly hold free trial versions and hands-on sessions where you can try it out for free, so we hope you will give it a try.

Free product introduction seminar

Why not try a PoC to see if "HULFT Square" can transform your business?

Why not try verifying how "connecting" can be utilized in your business, the feasibility of solving problems using data integration, and the benefits that can be obtained?

I want to automate data integration with SaaS, but I want to confirm the feasibility of doing so.
We want to move forward with data utilization, but we have issues with system integration
I want to consider data integration platform to achieve DX.

PoC Program | HULFT Square