What will happen to "generative AI/LLM" as the increase in training data reaches its limit?

  • Current IT and business columns

This is Watanabe from the marketing department.
This is a column that casually writes about various topics related to data, IT, etc.

Data for training generative AI is starting to run out

Generative AI such as ChatGPT has been a hot topic recently, but it seems that there has recently been a shortage of "data" used to train generative AI (LLM).

Conversational generative AI such as ChatGPT is realized using "large-scale language models (LLMs)." Google is also developing its own generative AI, and others, such as Elon Musk, are developing a generative AI (Grok) that can be used with X, but these were essentially competing to develop "large-scale language models (LLMs)."

⇒Large Language Model (LLM) | Glossary

What is a large-scale language model?

Please refer to the above for more details, but the characteristic of large-scale language models is that they are huge models trained using an ``enormous amount of data.''

For example, ChatGPT performs complex actions as if it were conversing with a human, but it is not designed to perform complex actions, nor has it been trained to perform complex tasks. It was trained on a task that simply predicted "what word (token) will appear next?" using "so much data that it was impossible to understand the meaning," and somehow it was able to achieve behavior that resembled a conversation with a human.

The key point is that it is a huge model trained using an "incredibly large amount of data," which is why it is called a "large" language model (LM).

Scaling Laws and "Power Laws"

The development race among companies is heating up, but behind this is the history of how it has become clear that simply increasing the amount of data used for learning (model size) can improve accuracy and reveal new capabilities.

For example, ChatGPT has the ability to translate, but it has not been trained on translation tasks. It was able to do something without being explicitly taught simply by "increasing the amount of data." This was a surprising development. Furthermore, we found that this emergent phenomenon of "increasing new capabilities" can be expected to continue to occur as the amount of data increases.

This meant that the "race to scale" was only going to heat up. Doubling the data would improve performance and reveal new capabilities, and doubling it again would yield similar results. So the idea was to prepare massive amounts of data, to grow as much as possible, and get ahead of the competition.

Furthermore, the fact that performance improves with increasing quantity also meant that the amount of money invested became a "battle for predictable results." Companies were competing to develop huge LLMs ahead of their competitors by investing large amounts of money, data, and expensive GPUs, using huge amounts of electricity to perform incredible calculations.

The inevitable lack of data

However, this approach of constantly increasing the amount of data was bound to fail eventually, because it was clear that if we continued to double and double again, we would soon reach a point where there would not be enough data even if we used all the data on Earth.

Although efforts to prepare new data are ongoing, it is not possible to continue preparing enough new data to continuously handle "double the amount." Recently, people have started talking about the situation where "it is no longer possible to increase the amount of data as before."

The end of the battle to "increase training data"

If the race to increase data volume has reached its limit, we will have no choice but to move in a different direction. In other words, the future of generative AI such as ChatGPT will no longer proceed by the same rules as before.

From now on, we will have to work on further development in directions other than increasing the amount of data. Unless groundbreaking initiatives emerge that replace scaling laws and change the situation, generative AI may continue to make slow progress, but it may enter an era of stagnation without the dramatic progress seen to date.

Why recent generative AI has started to promote itself as "deep in thought"

Recent generative AIs, such as ChatGPT o1, have been marketed as having deep logical thinking. We are also seeing other companies' generative AIs use slogans like "thinking more deeply" and "taking time to think."

The specific mechanism behind "ChatGPT o1" has not been made public, but it is believed to be a new approach that aims to make the machine smarter by "making it think more" rather than by increasing the amount of data.

Rather than using a large-scale language model to immediately generate an answer to the question, "What word appears next?", it appears that an internal multi-stage thought process is used, meaning that the answer is generated after processing the question many times.

Technically, it is believed that "deep thinking" is achieved using a mechanism similar to "Chain-of-Thought prompting" (a technique that makes generative AI think by following a specific thought process).

Chain-of-Thought Prompting | Glossary

To give an "image," students are made to "think carefully in multiple stages," such as by having them take the step of sorting out what they have been asked before answering, by having them consider how they should think before thinking and then think accordingly, by having them verify whether the answer they come up with is appropriate, or by having them think repeatedly until they come up with a good answer.

As the competition advances in this direction, when you ask ChatGPT a question, it may trigger a large amount of "thinking," which will be combined to generate an answer. The answer you see may be the result of ChatGPT's 10,000 internal thoughts.

An outrageous era in which "even text data on Earth will be mined"

It has been said that we are now in a post-digital transformation era, a different era from the past, and that we are living in the age of data. However, even when you have heard such things, have you ever somehow thought of them as just a matter of "mental preparation"?

However, in reality, the race to develop generative AI has reached a point where "all the text data available on Earth may have been mined." This means that such a thing is "realistically happening." It is frightening to think that the "battle for world supremacy" on the other side of the cloud is such an outrageous battle.

Currently, it seems like the competition is on "deep thinking," but it's difficult to predict how generative AI will develop in the future. It's also unclear whether the current situation of externally using generative AI as a huge cloud service will continue. Five years from now, it may be impossible to predict what it will be like to use generative AI to utilize data.

Even if we don't know what the future holds, the data itself will continue to exist. I believe that LLMs will continue to be used in new ways even as the situation changes. Therefore, having the ability to quickly change how we use data and LLMs in response to "new situations" will help us prepare for the future.

If you can quickly recreate linked processes using a GUI when new things arise, and if you can create a situation where you can freely link systems and data by "connecting" as needed, you should be able to stay ahead of other companies in the face of future changes. We hope you will use our "connecting" products to get through the coming times.

The person who wrote the article

Affiliation: Marketing Department, Digital Marketing Division

Ryo Watanabe

・2017: Transferred from Appresso Co., Ltd.
After majoring in information engineering (artificial intelligence lab) at university, I struggled in the development department of a startup.
・Small and medium-sized enterprise management consultant (as of 2024)
・Image: I took over the "Fukusuke" name that was previously used by our company.
(Affiliations are as of the time of publication)

Related Content

Return to column list