ZStandard (lossless data compression algorithm)

Glossary

"ZStandard (lossless data compression algorithm)"

This glossary explains various keywords that will help you understand the mindset necessary for data utilization and successful DX.
This time, we will take a look at new data compression algorithms and the significance of data compression in data utilization.

What is ZStandard (a lossless data compression algorithm)?

ZStandard (zstd) is a lossless data compression algorithm known for its high speed in data compression and decompression. It is a new algorithm developed by Yann Collet of Meta, a company at Facebook, and its reference implementation was released in 2016.
It is a technology used as a "means of compressing data" in situations such as compressing data into ZIP files, and is characterized by its significantly faster processing speed compared to the deflate algorithm used by ZIP. As it is an openly available technology, the use of ZStandard is increasing.

ZStandard is the latest technology for "ZIP files"

To put it very simply (though a little inaccurately), ZStandard is a new technology that has recently emerged to make data smaller, like a ZIP file. To be precise, it is a new technology equivalent to the ZIP file compression algorithm, DEFLATE.

Everyone knows the "ZIP file"
- File Extension: .zip
- Data compression algorithm: deflate
- A widely used de facto standard
The new technology we're talking about today
- File extension: .zst, etc.
- Data compression algorithm: ZStandard
- Open technology standardized as RFC

ZStandard is a new technology that began development in 2015, with the first reference implementation released in 2016. As mentioned earlier, it was developed by Yann Collet, an engineer at Facebook's Meta. The reference implementation has been released as freely available open source software and has also been standardized as an RFC. It is increasingly being adopted and used in Linux environments, and ZStandard is now included in the Linux kernel itself.

The big difference from ZIP (deflate) is that the compression and decompression processes are overwhelmingly faster, with processing times an order of magnitude faster.

How is "lossless data compression" possible?

There are two main types of data compression: lossless and lossy.

Lossy compression is the term used to describe most image and video compression algorithms. It is a technique that "does not significantly affect the way the image looks to the human eye, but significantly reduces the data size." While this often results in a significant reduction in size, the original data cannot be restored after compression.

In contrast, lossless compression algorithms are a technology that reduces data size while preserving all of the original data. When you think about it, this technology achieves something strange: reducing data size without losing any data. In fact, when I first learned about compression technology, I thought it was like magic.

Run Length Encoding (RLE)

The need for data compression has existed for a long time. Even in the early days of IT, the amount of data that could be stored in a computer's main memory or on external media was extremely limited, so there was a pressing need to store data compactly and save space.

For example, in the days of the 8-bit Famicom, the original Dragon Quest had to fit the entire game into just 64KB (kilobytes, not megabytes).

In those days, a common method for storing data compactly was to "fold consecutive data to make it smaller." For example, if you have data like "AAAABBCCDDDDD," you can reduce the size by folding consecutive characters of the same type to "A4B2C2D5."

For example, if you look at map data for a game, if you look at the data for an area that is grassland all the time, it seems like it's being filled with the same data over and over again, which is wasteful, or in graphics, if a black background area is filled with an endless string of zeros, it's obviously wasteful, so the idea is to "fold continuous data to eliminate waste." It's a simple idea, so it's not difficult for the CPU to process.

When we think of data compression methods simply, it seems that mechanisms that fold consecutive data like this are often devised, and it is a compression method that has been used for a long time. Even the lossless image file format "BMP" has a mode that can use RLE compression when the number of colors is small (in the case of "16 colors" or "256 colors" where the same data is likely to be repeated).

Dictionary Compression Algorithm

ZStandard (and ZIP's deflate) achieves advanced and versatile data compression based on two main concepts. One of them is a dictionary-based compression method that assumes that similar data patterns appear repeatedly in data.

For example, let's say you are creating a list of companies in Minato Ward, Tokyo. A large number of "same patterns" will appear in the data.

The string "Kabushiki Kaisha" appears frequently
The string "Minato Ward, Tokyo" appears frequently

There are probably also various "repeated patterns" in postal codes and telephone numbers.

If recording "Minato Ward, Tokyo" hundreds of times seems like a waste of space, why not record "Minato Ward, Tokyo" only once in the "Dictionary" and refer to the dictionary for each occurrence? This should save space. Alternatively, you could record "Minato Ward, Tokyo" only for the first occurrence, and then indirectly reference the string data with a pointer for subsequent occurrences.

You might think that compressing text data into a ZIP file is efficient at reducing the size of the data, but not so good at compressing binary data. However, if you think about the situations in which dictionary-based compression works best, you can understand why.

Entropy Coding

Another mechanism is compression by "variably varying the code length of the data and encoding it optimally."

For example, let's say you have 100 characters of data consisting of four characters, A to D. If you were to store the data normally, you would encode it into a binary bit string like this, resulting in 2 bits x 100 characters = 200 bits of data.

A：00
B：01
C：10
D：11

However, suppose the frequency of occurrence of the data was as follows:

A: 97 characters
B: 1 character
C: 1 character
D: 1 character

This kind of bias is common in commonly used data. For example, in English text, the letter "e" appears frequently, but letters like "q" appear only rarely. What happens if we turn the above example into a bit string using a variable-length representation like this?

A：0
B：100
C：101
D：110

When calculated together with the frequency of occurrence, data that was originally 200 bits (2 x 100) can be reduced to 106 bits (1 x 97 + 3 x 1 + 3 x 1 + 3 x 1) while retaining its content.

I won't go into a detailed explanation because it would be too complicated, but this compression method uses "Shannon's source coding theorem." By maximizing the amount of information contained in each bit (the amount of "awareness" that comes from knowing whether the bit is 0 or 1) through clever coding techniques, the information entropy of each bit is maximized, resulting in the shortest data length according to Shannon's theory.

In the example data, "A is almost always A, isn't it?", so A is given the shortest code length. The other information would be surprising if it appeared. This method makes the code length longer for such "more surprising information."

Why using something other than ZIP doesn't dramatically change the compressed size

Research into more efficient data compression continues every day. ZIP has been around for a long time, but new proposals have been made ever since. For example, there was a time when RAR was widely used by some people because it offered better compression than ZIP. Additionally, Microsoft installers use CAB instead of ZIP, and you may have seen 7-Zip (.7z) recently.

Although many new proposals have been made, ZIP is still widely used today. This is because the adoption of new methods often does not dramatically improve the compression size. For example, even though RAR does have a high compression rate, it only "slightly reduces the size," and there was no dramatic improvement, such as reducing the size to one-tenth of that of ZIP.

Why haven't we seen dramatic improvements? First of all, lossless compression cannot be made infinitely small; there is a theoretical limit. Also, both ZIP (deflate) and ZStandard, as well as subsequent lossless compression algorithms, are basically achieved by combining a dictionary-based compression algorithm and entropy coding, and the technology was already quite mature by the time ZIP was introduced.

ZStandard: A different direction of improvement and new use cases

On the other hand, ZStandard is not a new method that thoroughly pursues size reduction. Instead, it pursues "drastically improving processing speed while maintaining the same compression ability," which has attracted attention.

ZIP (Deflate) tends to take a long time to process, especially during compression. However, ZStandard can compress files at a high speed, with processing times that are more than an order of magnitude faster, while still maintaining the same compressed size.

You might be thinking, "Is it really that revolutionary just because it's faster?" However, that may be because we're still thinking about use cases based on the assumption that compression takes time. For example, we know that compression can save storage space, but we don't compress all the data on our computers. Why? Perhaps it's because we think it would be inconvenient because the compression process takes time every time.

ZStandard has the potential to unlock the potential of data compression technology, freeing it from such constraints. It should be able to be used in many applications where traditional methods such as ZIP (deflate) take too long to process and are inconvenient.

It will be possible to use it for a wider range of purposes

What would you think if you heard that the next version of Windows would "zip all data on your PC"? It might save space, but wouldn't you worry that it would be too heavy and difficult to use? Since the compression process seems like a heavy load, you might be concerned about using it for too many purposes. If the processing load is significantly smaller, it will be easier to use for a wider range of purposes than before.

Unreserved use with large data sets

For example, suppose you have a large number of gigabyte-sized log files that are straining your storage capacity. Compressing the log files can reduce them to 1/50th of their original size, saving space in one go, but it takes 3 minutes. While effective, you may be concerned about the high load for 3 minutes, and if another compression process starts during compression, it may even have a negative impact on the main process. However, when you try compressing them with ZStandard, you find that the compressed file size is roughly the same, but it takes 7 seconds. The only difference is the processing speed, but the hurdle to using it is significantly lowered.

Used for processing that requires real-time performance

Data compression can be used in data transfer processing. Data compression reduces the amount of data that needs to be transferred, potentially shortening the "communication time" until the transfer is complete. However, even if communication time is reduced from 10 seconds to 5 seconds, if the data compression process takes 20 seconds, the overall processing time will increase, which will have the opposite effect. What if it could be compressed to 1 second? It could be used more widely in applications that require real-time performance, such as data communication.

ZStandard usage example: file transfer

The use of IT in business is becoming more and more common, and as this happens, data integration is becoming increasingly important. Data integration is necessary not only for processing across multiple internal IT systems and cloud services, but also for inter-company IT use, where data integration with many other companies' systems, which may be completely different from your own, becomes necessary.

When it comes to IT systems that are responsible for business operations, a breakdown in data integration will halt operations, and delays in data integration will also cause delays. Furthermore, each company's systems often use different technologies, and data storage methods frequently differ due to industry differences. A means of maintaining long-term integration with environments that have diverse circumstances is necessary. File-based system integration (file transfer), which is well-suited to such situations, remains widely used.

Data transfers between companies often go via slow communication lines, and the communication partner may be on the other side of the world. Even when linking internal systems, it is not uncommon for large volumes of data to be processed continuously. In such situations, if data compression can significantly reduce the size of the files being transferred, a significant improvement can be expected.

ZStandard reduces the negative impact of compression latency

Although compressing files can reduce transfer size, the compression process itself is heavy with ZIP compression, and delays until compression is complete and the processing load on the server can actually reduce transfer capacity (for example, if you were able to send 3,600 files within the specified time, this could be reduced to 600 files after ZIP compression).

With ZStandard, you can expect compression to be over 10 times faster than ZIP (deflate). With ZStandard, there is less chance of adverse effects from compression, and performance should often improve.

ZStandard is a technology that will continue to be used in the future

ZStandard is an openly standardized technology known as RFC 8878, and is even adopted by the Linux kernel itself. It is expected that this technology will continue to be used around the world, and related software will continue to be provided, making it a technology that can be used safely and easily in the long term.

HULFT is a file transfer middleware that supports ZStandard.

HULFT is the de facto standard middleware for realizing data integration platform through file transfer, boasting an overwhelming track record in Japan. It also supports ZStandard directly.

HULFT is used by all Japanese financial institutions

Naturally, financial institutions' IT systems require a very high level of safety, security, and reliability. HULFT has a proven track record of being used by all Japanese financial institutions (all member companies of the Japan Bankers Association).

HULFT has long been the de facto standard file transfer infrastructure in Japan's mission-critical system, core system, boasting an overwhelming track record in areas where safety, security, and reliability are paramount.

Overwhelming data integration with old systems

HULFT has a long track record and has an overwhelming track record in supporting old technologies, such as mainframes and older hardware environments, and various commercial UNIX systems that were widely used in the past. Furthermore, we are adept at converting between data specific to older environments and currently commonly used data formats, such as fixed-length data, variable-length data, and EBCDIC Japanese data that includes external characters.

It is usually difficult for new IT environments such as Windows, Linux, and even cloud computing to coexist smoothly with older IT systems. However, it is not uncommon for useful business data to be stored in older environments. With HULFT, it is possible to effectively combine and utilize new and old technologies from different generations in a loosely coupled manner. For example, engineers who only know Linux can achieve data integration with mainframes, and engineers who only know mainframes can achieve data integration with the cloud.

"HULFT 10" supports the latest technologies

Because HULFT is a product with a long history, some people may have the impression that it is an old technology, but the latest HULFT (especially HULFT 10) is steadily adapting to the latest IT trends in line with the evolution of the times.

We are evolving our product to be usable comfortably even in modern systems, such as by redesigning it as a file transfer integration platform that can be used naturally even in cloud-native architectures, including file transfer with object storage from various companies such as AWS S3, and microservice architectures using container technology.

It can be used as a platform that allows new and old IT systems, even those from very different technological generations, to work together without any technical difficulties.

Furthermore, we are continually working to support new global standard technologies of the new era, such as the ZStandard data compression algorithm introduced here, and AES encryption, which has become the de facto standard in encryption technology. By updating your existing HULFT to the latest HULFT, you can update your data integration platform to support the latest technologies.

Related keywords (for further understanding)

file transfer
- It is a means of communication that serves as the foundation for IT systems that support various corporate activities. When it comes to data handled in relation to business, especially when it comes to utilizing IT related to administrative processing and accounting, exchanging data in file format is very common.
MFT（Managed File Transfer）
- This is a collaboration platform that realizes file-based collaboration processing with a high level of "safety, security, and reliability" that can support business activities. It can be used as a means of realizing IT systems that require a high level of reliability, such as error-free operations and audit responses.

If you are interested in HULFT and file transfer (MFT)

If you're interested, please try out the product that brings file transfer to life.

The definitive MFT "HULFT"

Please try HULFT, the pinnacle of domestically produced MFT products with an overwhelming track record in Japan, and the de facto standard for file transfer platforms.

It has an overwhelming track record, having been used for many years as the infrastructure for financial institutions that require the highest level of support for their IT systems. A world where all environments are connected by files can be created in an instant.

HULFT is now compatible with the latest IT environments, including integration with cloud services, and is used in situations where high performance is required, such as high-speed transfer of large files and transfer processing of large volumes of files.

Learn about the mechanisms of file transfer with HULFT product introduction and online seminar here

"HULFT WebConnect" enables safe and secure file transfer via the Internet

HULFT WebConnect is a cloud service that allows you to use HULFT 's safe, secure and reliable file transfer via the Internet.

You can achieve robust, enterprise-class file transfer between your own offices, overseas branches, and business partners using only a standard internet connection.

Transfers via HULFT can be made across the Internet.
Low cost as there is no need for costly dedicated lines or VPNs
Because it is a cloud service, you can start using it immediately without having to perform any operational work yourself.
The specifications are designed to be audit-friendly, such as not leaving any information on the transfer path.
There are functions that are designed for cases where communication partners are multiple companies.
- It has functions that can be used as a foundation for securely exchanging invoices, purchase orders, etc. with multiple business partners (including an easy-to-use dedicated client).

Click here for WebConnect product introductions and online seminars