Machine Learning

Glossary

"Machine learning"

This glossary explains various keywords that will help you understand the mindset necessary for data utilization and successful DX.
This time, let's take a look at machine learning, which has been attracting a lot of attention recently.

What is machine learning?

Machine learning is an application that automatically learns patterns hidden in data from given sample data.
This term refers to various technologies that enable robots to autonomously perform tasks such as making decisions based on given data, without the need for manual programming or parameter setting by humans.

Currently, when the term "AI" is used in society, it almost always refers to machine learning, which is an important element in so-called data utilization efforts for using data to achieve results.

When people talk about "AI," they are (mostly) referring to machine learning

The term "AI" is currently a hot topic in society. On the other hand, some people may not be familiar with the term machine learning. However, (as of the time of writing) when "AI" is being talked about in society, in most cases, it is actually "machine learning" that is being discussed. In other words, if you understand "machine learning," you will be able to understand what the AI that people are talking about is and what it is likely to be like in the future.

The "AI" everyone is talking about is essentially "machine learning"
(Once you understand machine learning, you will also understand so-called "AI.")

What is machine learning?

So what is machine learning and why is it such a hot topic?

In the past, when you wanted a computer to do something, you had to program and develop an application. One example of this would be creating your own Excel spreadsheet to achieve the functions you needed. Essentially, people would logically think about how a computer should work, set it up, and then operate it in that way.

However, machine learning uses computers in a different style. You prepare "data" and have the computer "learn" that data to make it behave in the way you want it to. The big difference is that it doesn't rely on programming.

Normal application
- A programmer writes a program, which defines how the computer will operate, creating an application with the desired functionality.
Machine Learning
- Prepare the data and have the machine learning engine "learn" it to perform the desired action.
- Humans do not instruct computers on specific behaviors. Computer behavior is generated by learning from data, not by programming.

It is said that we are entering the age of data. Machine learning, which can create IT systems that do something by utilizing prepared data, is a means of extracting value from data and can be said to be an important element in the age of data.

What exactly is machine learning?

So, what exactly does it mean to "prepare data" and "learn" from the data in order to use a computer?

Machine learning can be classified into several types. There are various ways to classify it, but I think it is often roughly divided into the following three types, so I will explain it using examples in each category.

Supervised Learning

This is a type of machine learning in which training data is provided as a set of "sample data (example input)" and "correct answer (example of desired output corresponding to the input)."

For example, if you want to determine whether a fruit in an image is an "apple" or an "orange," you would prepare a set of input data: "image data" and the corresponding output result, "apple or orange." The system would then learn decision rules from this data, allowing it to automatically determine whether an image is an "apple" or an "orange."

Example: I want to be able to classify given image data as "apples or oranges"
- training data
  - Input data: "Image data"
  - Data to be output (training data): Data to be output for the input image
    Judgment results of "This is an apple" and "This is an orange"
- The machine learning engine is fed with this data and trained.
- Given image data, a trained model can be created that can answer whether it is an apple or an orange.

In this way, we prepare a set of "input data" and the "output data (teacher data)" that we want to output in response to that input, and then use it to learn.

Supervised learning can be used not only for classification tasks, such as distinguishing between oranges and apples, but also for tasks that output numerical values. For example, it can be used for tasks such as inputting real estate property information and outputting the expected rent. In this case, the property information and the actual rent for that property are given as a set for training.

Example: I want to be able to "output estimated rent" from given property information.
- training data
  - Input data: Various property information (floor plan, age, etc.)
  - Data to be output (training data): Actual rent for the property
- The machine learning engine is fed with this data and trained.
- Enter property information and a trained model will be generated to answer the estimated rent.

It is important to note that with supervised learning, the problem of "simply providing data is not enough to enable learning" often becomes an issue. Learning may not be possible unless the "correct answer data" desired to be output is prepared, such as by a human being, and preparing the data can be a difficult task.

Unsupervised learning

This is a type of machine learning where the model is given only the data and autonomously learns the properties of the data from the data itself. No desired output is provided. This makes it easier to prepare the data used for learning and makes it easier to learn using large amounts of data, but the tasks that can be performed differ from supervised learning.

It is used for tasks such as autonomously classifying given data into several groups based on similarities, or autonomously finding trends that can be seen across the data.

Example: Classifying customers into groups based on their purchasing data
- Prepare customer purchase data
- Feed data to the machine learning engine and let it learn
  - As a result, the results were automatically classified into three groups.
- People could look at the results and make judgments such as, "This group is made up of students who are shopping on their way home from school," "This group is probably made up of housewives from the neighborhood," and "The last group seems to be people who work nearby and are coming for lunch."
- It is now possible to automatically classify customer data and determine which group they belong to.
Example: I want to find customer purchasing patterns from "customer purchasing data."
- Prepare customer purchase data
- Feed data to the machine learning engine and let it learn
  - Of those who visited Osaka by Shinkansen, 70% also visited Kyoto.
  - 30% of people who visited northern Osaka returned, and of those, 80% of people who visited southern Osaka also returned.
- A system will be created that can predict tourists' next actions and provide appropriate support and automated sales.

Reinforcement learning

This is a type of machine learning that autonomously tries to learn by trial and error, and then learns from the results. Rather than a human giving the correct answer, the computer tries to act autonomously and learns based on whether the results were good or bad.

For example, if you want to learn shogi, you can start by having computers play shogi against each other (even if they play poorly) and determine the winner. This will give you the specific "series of actions" taken in the game and the "results" (win or loss) of those actions, in other words, one set of "actual trial data."

The key point is that "the outcome has been decided for now," and this can be used as a basis for judgment. Even if it is unclear exactly what was good about the moves of the winning side, it can determine that "it must have included some moves that were close to the correct answer," and can apply positive learning to the "entire series of moves," and negative learning to the moves of the losing side. By repeating these trials, the system can autonomously progress in learning.

Example in Shogi
- Creating an environment where computers can play shogi against each other
- Prepare a bot that can play shogi, even if it's random, and have the computer play against each other.
- The winning move must have contained some good moves, and the factors that led to the overall move can be "reinforced" to learn from them.
- The losing player's move is assumed to contain some bad moves, though the exact reason is unknown, and so the player's behavior is not reinforced or is suppressed through negative learning.
- After playing a huge number of games against itself, an AI that is gradually becoming stronger at shogi will emerge.
Physical Robot Example
- Create an environment where the robot can act through trial and error (preparing an environment where the robot can act repeatedly without breaking down through trial and error)
- At first, the robot just moves randomly.
- Eventually, some desirable outcome will occur by chance, and the behavior that led to that outcome will be "reinforced" in the same way.
- After repeated trial and error, robots that could not even move before are now able to walk, and soccer robots learn teamwork tactics and are able to score goals.

Humans do not need to provide the correct answers; the system can learn to produce the desired results. Instead, a dynamic environment must be created where the system can learn autonomously through trial and error. Furthermore, learning can require a very large number of trials, which can result in inefficient learning.

There was a case where Google defeated the world champion in Go (AlphaGo), which was a remarkable achievement using reinforcement learning. By having the AI play against itself at an astronomical frequency, they created a Go AI that could beat humans.

There are other types besides the above

There are other types besides those mentioned above. For example, "supervised learning" and "unsupervised learning" are sometimes used in combination.

There are various types of learning, such as "semi-supervised learning," in which supervised learning is performed by assigning correct answers to only a portion of the data, and then unsupervised learning is performed using the remaining data, and "self-supervised learning," in which training data and correct answer data are automatically generated from prepared data. The currently popular large-scale language models (LLMs) are built using "self-supervised learning." Here is a brief introduction.

An example with semi-supervised learning:
Large amounts of image data + only a portion of the data, "This is an apple" "This is an orange"
- Unsupervised learning is performed on a large amount of image data, common features are extracted from the entire image, and each image is labeled with a feature indicating whether or not it has that feature.
- Supervised learning is performed using image data with correct answers and feature labels.
  - More effective learning using feature labels
Examples of self-supervised learning:
Prepare a very large amount of text data (for example, in a large-scale language model)
- By masking part of the text data and asking questions such as "What word will appear next?" or "What word will go in this missing part?", it is possible to automatically generate training data with correct answers (teaching data).
- It will be possible to perform supervised learning using large amounts of data without preparing the correct answer.

The explanation for ChatGPT states that it "predicts the next word," but the reason it does so is that it can use self-supervised learning.

Why AI is not machine learning

I wrote that machine learning is what people call AI. If that's the case, why do we bother calling it machine learning when it would be easier for everyone to understand if we just called it AI? There are several reasons for this.

The definition of the term "AI" is unclear

What is AI? Or, even before AI (artificial intelligence), what is intelligence in the first place? This is a difficult question, and there are many definitions and discussions, so it has not yet been clearly defined. The more expert you are, the more careful you are with your words, so you don't often use the term AI, which has an unclear meaning.

Because there is "AI" that is not machine learning

There are efforts to create applications that behave intelligently using methods other than machine learning.

Logic-based approach:
For example, if you provide logical statements like "All humans are mortal (x) -> mortal (x)" and "Socrates is human (Socrates)," the computer can automatically infer and arrive at the answer "Socrates is mortal (Socrates)." Most of the earliest AI research was of this type.
Rules-based approach:
These are written as a number of rules (similar to IF statements) that use the skills of experts to make judgments, and include what used to be called expert systems. For example, "If the flame is almost transparent, it's likely that the temperature inside the furnace is over 1800 degrees," or "As long as there is a crackling sound, the material is likely not yet melted." These systems compare these rules with observed data to make judgments.

For example, if you can write down business rules and automate business decisions based on them, you are making practical intelligent decisions. Although this is not machine learning, it can also be considered a type of AI.

For these reasons, I think it's easier for people to see you as someone who really understands the term "machine learning" if you use the term.

Utilizing machine learning in actual business is difficult because there are many difficulties surrounding it.

Next, let's consider how machine learning can actually be used in business.

It is often thought that if you prepare data, you can let it learn and use AI (machine learning) in your work. With that understanding, it may not seem that difficult to utilize. However, when you actually try it, there are many challenges other than the machine learning itself.

"Preparing the data" is difficult

To utilize machine learning, you first need to prepare "data." Unexpectedly, this often requires a "considerable amount of man-hours." Using a "data integration platform," which will be discussed later, can sometimes ease the difficulty.

The data required for learning is often scattered throughout the company, making it difficult to gather it.
The content and format of the data are often inconsistent, so it is necessary to preprocess and organize the data before learning.
The problem of how to prepare training data

It is difficult to train and create a trained model.

Naturally, creating a trained model using data is also difficult, and can require highly skilled engineers and a computing platform with powerful computing capabilities.

It is often difficult to learn from scratch and prepare the trained model you need, so you may need to use cloud services that make machine learning easy to use or find ways to effectively reuse existing trained models.

AutoML
Fine Tuning
Transfer Learning
In-Context Learning
RAG（Retrieval-Augmented Generation）

It is difficult because the trained model is useless unless it is incorporated into business operations.

We managed to collect the data and create a trained model, but there was still a difficult part. It may seem obvious, but we won't see results unless we incorporate the trained model into our work.

To make a decision, "input data" is required
- It is necessary to create a system that retrieves and connects data from business systems to enable machine learning to make business decisions.
- If there is no data necessary for making a decision, it is necessary to create a system to acquire new data.
In order to produce results from the judgment results (output data), a system for utilizing them in business is required.
- Even if useful decisions can now be made, in order to utilize them in business, it is necessary to establish a system that automatically links the decision results to business systems and a system that links the decision results to business personnel.

This means that in order to utilize machine learning (trained models) in business operations, "system development" is required to link it with existing business systems.

"Iterative improvement" is often necessary

Furthermore, it is often necessary to make an effort to "iteratively improve" what has been created.

It often happens that even if you try to create something, it doesn't produce satisfactory results right from the start.
It may be necessary to start over from collecting data and learning.
It is often difficult to know where and how machine learning should be used in business to produce good results without actually using it in actual business and going through trial and error.

This means that all the hard work you've done up to this point may have to be redone over and over again.

"Re-collecting data" refers to the effort required to re-collect data that is found to be necessary after testing to improve performance, and to start over from the learning stage. This also includes the difficult case where the necessary data does not exist and you have to start from developing a data collection system.

The need for trial and error in figuring out how to integrate it with business processes means that the development of a system that links machine learning with business systems may need to be redone through repeated trial and error until results are achieved. This also requires time and effort.

The "connecting" technology necessary to put machine learning to practical use in the data era

Machine learning, which can produce results from data, is a crucial technology in the "age of data." If you are thinking about DX or data utilization, the use of machine learning is a topic you should definitely consider.

However, as mentioned above, utilizing machine learning is quite a time-consuming process. It requires collecting data, learning from it, and combining it with internal and external IT systems, as well as repeated trial and error to determine how to achieve results.

What is necessary is the ability to "collaborate"

For this reason, even if you try to utilize machine learning in a straightforward manner, you may not see results. For example, I think this is exactly why we often see cases where a PoC is carried out and produces results, but the results are not deployed in practice and the project is left as is. That's how much effort and trouble is involved "other than machine learning itself."

So, should you think that your company doesn't have the highly skilled engineers to utilize machine learning? In fact, there is a way to "still get started with machine learning."

First, let's talk about machine learning itself. Cloud services (such as AutoML) that allow you to use machine learning without specialized knowledge are being developed. You can consider utilizing these. As for the work other than machine learning itself, if you have a way to link data and IT systems in a wide range of ways, you can actually manage a lot of things.

Obtaining training data from within the company is "data integration," preprocessing the data and training it is "data processing," and linking the trained model with the company's internal systems is "system integration." In other words, if we can "connect" "data and systems," we can somehow meet many needs.

Please utilize "connecting" technology

There are ways to efficiently develop these various "integration processes" using only a GUI. These are "connecting" technologies known as "EAI," "ETL," and "iPaaS," such as "DataSpider" and "HULFT Square." By utilizing these, new and old systems can be integrated smoothly and efficiently.

Can be used with GUI only

Unlike regular programming, there is no need to write code. By placing and configuring icons on the GUI, you can achieve integration with a wide variety of systems, data, and cloud services.

Being able to develop using a GUI is also an advantage

No-code development using only a GUI may seem like a simple compromise compared to full-scale programming. However, being able to develop using only a GUI allows on-site personnel to proactively work on cloud integration themselves.

The people who understand the business best are the people on the front lines. These "people who know best" can think for themselves about how to use trained models, what data is missing, and steadily develop what needs to be realized. This is far superior to a situation where development can only be carried out by explaining and asking engineers for help every time something needs to be done.

Full-scale processing can be implemented

There are many products that claim to allow development using only a GUI, but some people may have a negative impression of such products as being too simple.

It is true that things like "it's easy to make, but it can only do simple things," "when I tried to execute a full-scale process it couldn't process and crashed," or "it didn't have the high reliability or stable operating capacity to support business operations, which caused problems" tend to occur.

"DataSpider" and "HULFT Square" are easy to use, but also allow you to create processes at the same level as full-scale programming. They have the same high processing power as full-scale programming, as they are internally converted to Java and executed, and have a long history of supporting corporate IT. They combine the benefits of "GUI only" with full-scale capabilities.

No need to operate in-house as it is iPaaS

DataSpider can be operated securely on a system under your own management. With HULFT Square, a cloud service (iPaaS), this "connecting" technology itself can be used as a cloud service without the need for in-house operation, eliminating the hassle of in-house implementation and system operation.

Related keywords (for further understanding)

Machine learning related keywords

Keywords related to Generative AI/ChatGPT

Keywords related to data integration and system integration

EAI
- It is a concept of "connecting" systems by data integration, and is a means of freely connecting various data and systems. It is a concept that has been used since long before the cloud era as a way to effectively utilize IT.
ETL
- In the recent trend of actively working on data utilization, the majority of the work is not the data analysis itself, but rather the collection and preprocessing of data scattered around, from on-premise to cloud. This is a means to carry out such processing efficiently.
iPaaS
- A cloud service that "connects" various clouds with external systems and data simply by operating on a GUI is called iPaaS.

Are you interested in "iPaaS" and "connecting" technologies?

Try out our products that allow you to freely connect various data and systems, from on-premise IT systems to cloud services, and make successful use of IT.

The ultimate "connecting" tool: data integration software "DataSpider" and data integration platform "HULFT Square"

"DataSpider," data integration tool developed and sold by our company, is a "connecting" tool with a long history of success. "HULFT Square," a data integration platform, is a "connecting" cloud service developed using DataSpider technology.

Another feature is that development can be done using only the GUI (no code) without writing code like in regular programming, so business staff who have a good understanding of their company's business can take the initiative to use it.

Try out DataSpider/ HULFT Square 's "connecting" technology:

There are many simple collaboration tools on the market, but this tool can be used with just a GUI, is easy enough for even non-programmers to use, and has "high development productivity" and "full-fledged performance that can serve as the foundation for business (professional use)."

It can smoothly solve the problem of "connecting disparate systems and data" that hinders successful IT utilization. We regularly hold free trial versions and hands-on sessions where you can try it out for free, so we hope you will give it a try.

Free product introduction seminar

Why not try a PoC to see if "HULFT Square" can transform your business?

Why not try verifying how "connecting" can be utilized in your business, the feasibility of solving problems using data integration, and the benefits that can be obtained?

I want to automate data integration with SaaS, but I want to confirm the feasibility of doing so.
We want to move forward with data utilization, but we have issues with system integration
I want to consider data integration platform to achieve DX.

PoC Program | HULFT Square