ZStandard (lossless data compression algorithm)
"ZStandard (lossless data compression algorithm)"
This glossary explains various keywords that will help you understand the mindset necessary for data utilization and successful DX.
This time, we will take a look at new data compression algorithms and the significance of data compression in data utilization.
What is ZStandard (a lossless data compression algorithm)?
ZStandard (zstd) is a lossless data compression algorithm known for its high speed in data compression and decompression. It is a new algorithm developed by Yann Collet of Meta, a company at Facebook, and its reference implementation was released in 2016.
It is a technology used as a "means of compressing data" in situations such as compressing data into ZIP files, and is characterized by its significantly faster processing speed compared to the deflate algorithm used by ZIP. As it is an openly available technology, the use of ZStandard is increasing.
ZStandard is the latest technology for "ZIP files"
To put it very simply (though a little inaccurately), ZStandard is a new technology that has recently emerged to make data smaller, like a ZIP file. To be precise, it is a new technology equivalent to the ZIP file compression algorithm, DEFLATE.
- Everyone knows the "ZIP file"
- File Extension: .zip
- Data compression algorithm: deflate
- A widely used de facto standard
- The new technology we're talking about today
- File extension: .zst, etc.
- Data compression algorithm: ZStandard
- Open technology standardized as RFC
ZStandard is a new technology that began development in 2015, with the first reference implementation released in 2016. As mentioned earlier, it was developed by Yann Collet, an engineer at Facebook's Meta. The reference implementation has been released as freely available open source software and has also been standardized as an RFC. It is increasingly being adopted and used in Linux environments, and ZStandard is now included in the Linux kernel itself.
The big difference from ZIP (deflate) is that the compression and decompression processes are overwhelmingly faster, with processing times an order of magnitude faster.
How is "lossless data compression" possible?
There are two main types of data compression: lossless and lossy.
Lossy compression is the term used to describe most image and video compression algorithms. It is a technique that "does not significantly affect the way the image looks to the human eye, but significantly reduces the data size." While this often results in a significant reduction in size, the original data cannot be restored after compression.
In contrast, lossless compression algorithms are a technology that reduces data size while preserving all of the original data. When you think about it, this technology achieves something strange: reducing data size without losing any data. In fact, when I first learned about compression technology, I thought it was like magic.
Run Length Encoding (RLE)
The need for data compression has existed for a long time. Even in the early days of IT, the amount of data that could be stored in a computer's main memory or on external media was extremely limited, so there was a pressing need to store data compactly and save space.
For example, in the days of the 8-bit Famicom, the original Dragon Quest had to fit the entire game into just 64KB (kilobytes, not megabytes).
In those days, a common method for storing data compactly was to "fold consecutive data to make it smaller." For example, if you have data like "AAAABBCCDDDDD," you can reduce the size by folding consecutive characters of the same type to "A4B2C2D5."
For example, if you look at map data for a game, if you look at the data for an area that is grassland all the time, it seems like it's being filled with the same data over and over again, which is wasteful, or in graphics, if a black background area is filled with an endless string of zeros, it's obviously wasteful, so the idea is to "fold continuous data to eliminate waste." It's a simple idea, so it's not difficult for the CPU to process.
When we think of data compression methods simply, it seems that mechanisms that fold consecutive data like this are often devised, and it is a compression method that has been used for a long time. Even the lossless image file format "BMP" has a mode that can use RLE compression when the number of colors is small (in the case of "16 colors" or "256 colors" where the same data is likely to be repeated).
Dictionary Compression Algorithm
ZStandard (and ZIP's deflate) achieves advanced and versatile data compression based on two main concepts. One of them is a dictionary-based compression method that assumes that similar data patterns appear repeatedly in data.
For example, let's say you are creating a list of companies in Minato Ward, Tokyo. A large number of "same patterns" will appear in the data.
- The string "Kabushiki Kaisha" appears frequently
- The string "Minato Ward, Tokyo" appears frequently
There are probably also various "repeated patterns" in postal codes and telephone numbers.
If recording "Minato Ward, Tokyo" hundreds of times seems like a waste of space, why not record "Minato Ward, Tokyo" only once in the "Dictionary" and refer to the dictionary for each occurrence? This should save space. Alternatively, you could record "Minato Ward, Tokyo" only for the first occurrence, and then indirectly reference the string data with a pointer for subsequent occurrences.
You might think that compressing text data into a ZIP file is efficient at reducing the size of the data, but not so good at compressing binary data. However, if you think about the situations in which dictionary-based compression works best, you can understand why.
Entropy Coding
Another mechanism is compression by "variably varying the code length of the data and encoding it optimally."
For example, let's say you have 100 characters of data consisting of four characters, A to D. If you were to store the data normally, you would encode it into a binary bit string like this, resulting in 2 bits x 100 characters = 200 bits of data.
- A:00
- B:01
- C:10
- D:11
However, suppose the frequency of occurrence of the data was as follows:
- A: 97 characters
- B: 1 character
- C: 1 character
- D: 1 character
This kind of bias is common in commonly used data. For example, in English text, the letter "e" appears frequently, but letters like "q" appear only rarely. What happens if we turn the above example into a bit string using a variable-length representation like this?
- A:0
- B:100
- C:101
- D:110
When calculated together with the frequency of occurrence, data that was originally 200 bits (2 x 100) can be reduced to 106 bits (1 x 97 + 3 x 1 + 3 x 1 + 3 x 1) while retaining its content.
I won't go into a detailed explanation because it would be too complicated, but this compression method uses "Shannon's source coding theorem." By maximizing the amount of information contained in each bit (the amount of "awareness" that comes from knowing whether the bit is 0 or 1) through clever coding techniques, the information entropy of each bit is maximized, resulting in the shortest data length according to Shannon's theory.
In the example data, "A is almost always A, isn't it?", so A is given the shortest code length. The other information would be surprising if it appeared. This method makes the code length longer for such "more surprising information."
Why using something other than ZIP doesn't dramatically change the compressed size
Research into more efficient data compression continues every day. ZIP has been around for a long time, but new proposals have been made ever since. For example, there was a time when RAR was widely used by some people because it offered better compression than ZIP. Additionally, Microsoft installers use CAB instead of ZIP, and you may have seen 7-Zip (.7z) recently.
Although many new proposals have been made, ZIP is still widely used today. This is because the adoption of new methods often does not dramatically improve the compression size. For example, even though RAR does have a high compression rate, it only "slightly reduces the size," and there was no dramatic improvement, such as reducing the size to one-tenth of that of ZIP.
Why haven't we seen dramatic improvements? First of all, lossless compression cannot be made infinitely small; there is a theoretical limit. Also, both ZIP (deflate) and ZStandard, as well as subsequent lossless compression algorithms, are basically achieved by combining a dictionary-based compression algorithm and entropy coding, and the technology was already quite mature by the time ZIP was introduced.
ZStandard: A different direction of improvement and new use cases
On the other hand, ZStandard is not a new method that thoroughly pursues size reduction. Instead, it pursues "drastically improving processing speed while maintaining the same compression ability," which has attracted attention.
ZIP (Deflate) tends to take a long time to process, especially during compression. However, ZStandard can compress files at a high speed, with processing times that are more than an order of magnitude faster, while still maintaining the same compressed size.
You might be thinking, "Is it really that revolutionary just because it's faster?" However, that may be because we're still thinking about use cases based on the assumption that compression takes time. For example, we know that compression can save storage space, but we don't compress all the data on our computers. Why? Perhaps it's because we think it would be inconvenient because the compression process takes time every time.
ZStandard has the potential to unlock the potential of data compression technology, freeing it from such constraints. It should be able to be used in many applications where traditional methods such as ZIP (deflate) take too long to process and are inconvenient.
It will be possible to use it for a wider range of purposes
What would you think if you heard that the next version of Windows would "zip all data on your PC"? It might save space, but wouldn't you worry that it would be too heavy and difficult to use? Since the compression process seems like a heavy load, you might be concerned about using it for too many purposes. If the processing load is significantly smaller, it will be easier to use for a wider range of purposes than before.
Unreserved use with large data sets
For example, suppose you have a large number of gigabyte-sized log files that are straining your storage capacity. Compressing the log files can reduce them to 1/50th of their original size, saving space in one go, but it takes 3 minutes. While effective, you may be concerned about the high load for 3 minutes, and if another compression process starts during compression, it may even have a negative impact on the main process. However, when you try compressing them with ZStandard, you find that the compressed file size is roughly the same, but it takes 7 seconds. The only difference is the processing speed, but the hurdle to using it is significantly lowered.
Used for processing that requires real-time performance
Data compression can be used in data transfer processing. Data compression reduces the amount of data that needs to be transferred, potentially shortening the "communication time" until the transfer is complete. However, even if communication time is reduced from 10 seconds to 5 seconds, if the data compression process takes 20 seconds, the overall processing time will increase, which will have the opposite effect. What if it could be compressed to 1 second? It could be used more widely in applications that require real-time performance, such as data communication.
ZStandard usage example: file transfer
The use of IT in business is becoming more and more common, and as this happens, data integration is becoming increasingly important. Data integration is necessary not only for processing across multiple internal IT systems and cloud services, but also for inter-company IT use, where data integration with many other companies' systems, which may be completely different from your own, becomes necessary.
When it comes to IT systems that are responsible for business operations, if data integration stops, business operations will stop, and if data integration is delayed, business operations will be delayed. Furthermore, each company's system may have different technologies, and it is common for data to be stored in different ways due to differences in industry. A method is needed to maintain long-term integration with environments with a wide variety of circumstances. System integration via files (file integration), which is well suited to such situations, is still widely used today.
Data transfers between companies often go via slow communication lines, and the communication partner may be on the other side of the world. Even when linking internal systems, it is not uncommon for large volumes of data to be processed continuously. In such situations, if data compression can significantly reduce the size of the files being transferred, a significant improvement can be expected.
ZStandard reduces the negative impact of compression latency
Although compressing files can reduce transfer size, the compression process itself is heavy with ZIP compression, and delays until compression is complete and the processing load on the server can actually reduce transfer capacity (for example, if you were able to send 3,600 files within the specified time, this could be reduced to 600 files after ZIP compression).
With ZStandard, you can expect compression to be over 10 times faster than ZIP (deflate). With ZStandard, there is less chance of adverse effects from compression, and performance should often improve.
ZStandard is a technology that will continue to be used in the future
ZStandard is an openly standardized technology known as RFC 8878, and is even adopted by the Linux kernel itself. It is expected that this technology will continue to be used around the world, and related software will continue to be provided, making it a technology that can be used safely and easily in the long term.
HULFT, a file transfer middleware that supports ZStandard
HULFT is the de facto standard for middleware that provides data integration platform through file integration, with an overwhelming track record in Japan. The product itself is also compatible with ZStandard.
HULFT is used by all Japanese financial institutions
Naturally, financial institutions' IT systems require a very high level of safety, security, and reliability. HULFT has a proven track record of being used by all Japanese financial institutions (all member companies of the Japan Bankers Association).
HULFT has been the de facto standard for file integration platforms in Japan's mission-critical system, core system for many years, and has an overwhelming track record in fields where safety, security, and reliability are required.
Overwhelming data integration with old systems
HULFT has a long track record and has an overwhelming track record in supporting old technologies, such as mainframes and older hardware environments, and various commercial UNIX systems that were widely used in the past. Furthermore, we are adept at converting between data specific to older environments and currently commonly used data formats, such as fixed-length data, variable-length data, and EBCDIC Japanese data that includes external characters.
It is usually difficult for new IT environments such as Windows, Linux, and even cloud computing to coexist smoothly with older IT systems. However, it is not uncommon for useful business data to be stored in older environments. With HULFT, it is possible to effectively combine and utilize new and old technologies from different generations in a loosely coupled manner. For example, engineers who only know Linux can achieve data integration with mainframes, and engineers who only know mainframes can achieve data integration with the cloud.
"HULFT 10" supports the latest technologies
Because HULFT is a product with a long history, some people may have the impression that it is an old technology, but the latest HULFT (especially HULFT 10) is steadily adapting to the latest IT trends in line with the evolution of the times.
We are evolving our products to make them easier to use with modern systems, such as by restructuring them into a file sharing platform that can be used naturally with cloud-native architectures, including file sharing with various companies' object storage systems such as AWS's S3, and microservice architectures using container technology.
It can be used as a platform that allows new and old IT systems, even those from very different technological generations, to work together without any technical difficulties.
Furthermore, we are continually working to support new global standard technologies of the new era, such as the ZStandard data compression algorithm introduced here, and AES encryption, which has become the de facto standard in encryption technology. By updating your existing HULFT to the latest HULFT, you can update your data integration platform to support the latest technologies.
Related keywords (for further understanding)
- File Linkage
- It is a means of communication that serves as the foundation for IT systems that support various corporate activities. When it comes to data handled in relation to business, especially when it comes to utilizing IT related to administrative processing and accounting, exchanging data in file format is very common.
- MFT(Managed File Transfer)
- This is a collaboration platform that realizes file-based collaboration processing with a high level of "safety, security, and reliability" that can support business activities. It can be used as a means of realizing IT systems that require a high level of reliability, such as error-free operations and audit responses.
Are you interested in HULFT and file transfer (MFT)?
If you are interested, please try out the product that brings the world of file sharing to life.
The definitive MFT "HULFT"
Please try out HULFT, the pinnacle of domestic MFT products with an overwhelming track record in Japan and the de facto standard for file integration platforms.
It has an overwhelming track record, having been used for many years as the infrastructure for financial institutions that require the highest level of support for their IT systems. A world where all environments are connected by files can be created in an instant.
HULFT is now compatible with the latest IT environments, including integration with cloud services, and is used in situations where high performance is required, such as high-speed transfer of large files and transfer processing of large volumes of files.
"HULFT WebConnect" enables safe and secure file transfer via the Internet
HULFT WebConnect is a cloud service that allows you to use HULFT 's safe, secure and reliable file transfer via the Internet.
You can achieve a solid enterprise-class file sharing method not only between your own company's locations, but also between overseas branches and business partners, all using just a regular Internet connection.
- Transfers via HULFT can be made across the Internet.
- Low cost as there is no need for costly dedicated lines or VPNs
- Because it is a cloud service, you can start using it immediately without having to perform any operational work yourself.
- The specifications are designed to be audit-friendly, such as not leaving any information on the transfer path.
- There are functions that are designed for cases where communication partners are multiple companies.
- It has functions that can be used as a foundation for securely exchanging invoices, purchase orders, etc. with multiple business partners (including an easy-to-use dedicated client).
Glossary Column List
Alphanumeric characters and symbols
- The Cliff of 2025
- 5G
- AI
- API [Detailed version]
- API Infrastructure and API Management [Detailed Version]
- BCP
- BI
- BPR
- CCPA (California Consumer Privacy Act) [Detailed Version]
- Chain-of-Thought Prompting [Detailed Version]
- ChatGPT (Chat Generative Pre-trained Transformer) [Detailed version]
- CRM
- CX
- D2C
- DBaaS
- DevOps
- DWH [Detailed version]
- DX certified
- DX stocks
- DX Report
- EAI [Detailed version]
- EDI
- EDINET [Detailed version]
- ERP
- ETL [Detailed version]
- Excel Linkage [Detailed version]
- Few-shot prompting / Few-shot learning [detailed version]
- FIPS140 [Detailed version]
- FTP
- GDPR (EU General Data Protection Regulation) [Detailed version]
- Generated Knowledge Prompting (Detailed Version)
- GIGA School Initiative
- GUI
- IaaS [Detailed version]
- IoT
- iPaaS [Detailed version]
- MaaS
- MDM
- MFT (Managed File Transfer) [Detailed version]
- MJ+ (standard administrative characters) [Detailed version]
- NFT
- NoSQL [Detailed version]
- OCR
- PaaS [Detailed version]
- PCI DSS [Detailed version]
- PoC
- REST API (Representational State Transfer API) [Detailed version]
- RFID
- RPA
- SaaS (Software as a Service) [Detailed version]
- SaaS Integration [Detailed Version]
- SDGs
- Self-translate prompting / "Think in English, then answer in Japanese" [Detailed version]
- SFA
- SOC (System and Organization Controls) [Detailed version]
- Society 5.0
- STEM education
- The Flipped Interaction Pattern (Please ask if you have any questions) [Detailed version]
- UI
- UX
- VUCA
- Web3
- XaaS (SaaS, PaaS, IaaS, etc.) [Detailed version]
- XML
- ZStandard (lossless data compression algorithm) [detailed version]
A row
- Avatar
- Crypto assets
- Ethereum
- Elastic (elasticity/stretchability) [detailed version]
- Autoscale
- Open data (detailed version)
- On-premise [Detailed version]
Ka row
- Carbon Neutral
- Virtualization
- Government Cloud [Detailed Version]
- availability
- completeness
- Machine Learning [Detailed Version]
- mission-critical system, core system
- confidentiality
- Cashless payment
- Symmetric key cryptography / DES / AES (Advanced Encryption Standard) [Detailed version]
- Business automation
- Cloud
- Cloud Migration
- Cloud Native [Detailed Version]
- Cloud First
- Cloud Collaboration [Detailed Version]
- Retrieval Augmented Generation (RAG) [Detailed version]
- In-Context Learning (ICL) [Detailed version]
- Container [Detailed version]
- Container Orchestration [Detailed Version]
Sa row
- Serverless (FaaS) [Detailed version]
- Siloization [Detailed version]
- Subscription
- Supply Chain Management
- Singularity
- Single Sign-On (SSO) [Detailed version]
- Scalable (scale up/scale down) [Detailed version]
- Scale out
- Scale in
- Smart City
- Smart Factory
- Small start (detailed version)
- Generative AI (Detailed version)
- Self-service BI (IT self-service) [Detailed version]
- Loose coupling [detailed version]
Ta row
- Large Language Model (LLM) [Detailed version]
- Deep Learning
- Data Migration
- Data Catalog
- Data Utilization
- Data Governance
- Data Management
- Data Scientist
- Data-driven
- Data analysis
- Database
- Data Mart
- Data Mining
- Data Modeling
- Data Lineage
- Data Lake [Detailed version]
- data integration / data integration platform [Detailed Version]
- Digitization
- Digitalization
- Digital Twin
- Digital Disruption
- Digital Transformation
- Deadlock [Detailed version]
- Telework
- Transfer learning (detailed version)
- Electronic Payment
- Electronic Signature [Detailed Version]
Na row
Ha row
- Hybrid Cloud
- Batch Processing
- Unstructured Data
- Big Data
- File Linkage [Detailed version]
- Fine Tuning [Detailed Version]
- Private Cloud
- Blockchain
- Prompt template [detailed version]
- Vectorization/Embedding [Detailed version]
- Vector database (detailed version)
Ma row
- Marketplace
- migration
- Microservices (Detailed Version)
- Managed Services [Detailed Version]
- Multi-tenant
- Middleware
- Metadata
- Metaverse
Ya row
Ra row
- Leapfrogging (detailed version)
- quantum computer
- Route Optimization Solution
- Legacy System/Legacy Integration [Detailed Version]
- Low-code development (detailed version)
- Role-Play Prompting [Detailed Version]
