Why DWH bills are so high: Reassessing cloud costs with ETL x ELT

Why DWH bills are so high: Reassessing cloud costs with ETL x ELT

With the advancement of data-driven management, rising cloud costs are putting pressure on corporate finances in an increasing number of cases. The main cause of these costs is the high-intensity data join processing performed within data warehouses (DWH).
In this column, we will introduce a method for reducing data warehouse resource consumption and controlling costs through preprocessing using ETL.

data integration data utilization

The "hidden costs" behind data utilization

As data-driven management becomes more prevalent, the amount of data handled and the frequency of analysis are increasing explosively, resulting in a rapid increase in the burden on system platforms and infrastructure costs.

In particular, with the advancement of machine learning (AI/ML) and generative AI, the datasets required for model training are becoming increasingly large and complex. Furthermore, the widespread use of unstructured data such as text and images in addition to traditional structured data is also a major factor driving up storage and processing costs.

Some readers of this column may be worried about their cloud bills ballooning beyond their initial expectations. In a cloud-native environment, delays in data processing not only delay decision-making, but also lead to unnecessary resource consumption as processing drags out, resulting in a significant impact on costs. While data utilization should generate profits, before you know it, costs are only increasing... How can you avoid this situation? This article delves into the concepts and countermeasures involved.

The culprit behind the increased costs

In DWH and distributed database environments, the majority of processing costs come from "joining data together" rather than simply filtering data.

Distributed data processing platforms such as Snowflake, Google BigQuery, and AWS's Amazon Redshift store large amounts of data across multiple servers (nodes). When performing a join, the system must physically gather data with matching keys in the same location and match them. This process of sending large amounts of data between nodes is called "shuffling."

This "shuffle" is one of the causes of unexpectedly high cloud costs. According to an official Snowflake case study, an unoptimized join can increase network load by more than 10 times (from approximately 3 GB to approximately 39 GB), and this sudden increase in data transfer volume is a major factor in reducing performance and consuming resources.

This "shuffling" also affects processing time. As the number of joins increases, the processing load increases quadratically. In one case, the initial join took only about 610 slot seconds, but the final step, which involved repeated joins, took approximately 37,000 slot seconds (about 60 times longer).

In a business where quick decision-making is required, designs that rely on joins, such as "joining because it's necessary for analysis," can unknowingly waste infrastructure costs and undermine the project's ROI (return on investment).

▼I want to know more about DWH
DWH|Glossary

Modern data integration design philosophy

Now that data integration has become a major factor in driving up costs, attention is once again being paid to the difference between the data processing design concepts of "ETL" and "ELT."

The traditionally mainstream ETL (Extract/Transform/Load) method processes and formats data extracted from the source "before putting it into the DWH." On the other hand, ELT (Extract/Load/Transform), which has become popular with the advent of the cloud computing era, refers to a method in which raw data is first loaded into the DWH and then transformed "within the DWH" using its powerful computing resources.

In the past, because the computational resources of DWHs were expensive, "ETL" was required to format data before storing it. Later, with the spread of cloud-based DWHs, "ELT," which involves first loading data and then processing it within the DWH, became popular as a convenient method. However, due to the sharp increase in costs caused by joins as mentioned in the previous section, there are now limits to performing all data processing within the DWH (ELT).

What's important here is "pre-processing" by the ETL tool. Before inputting the data into the DWH, the ETL tool reduces unnecessary data and performs pre-aggregation and cleansing to avoid heavy joins. By performing this "pre-processing," it is possible to reduce unnecessary consumption of computing resources (credits and slots) on the DWH side.

In other words, a hybrid approach that combines "speedy ELT" with "controlled and cost-optimized ETL" in the right places is one of the key points in modern data engineering.

▼I want to know more about ETL
ETL|Glossary

Modern data integration design philosophy

"Pre-processing" to improve DWH costs

Now that ETL has become mainstream, the biggest advantage of using an ETL tool before the DWH is that it prevents the expensive DWH computational resources from being used for unnecessary shuffling.

For example, you can reduce the cardinality of the data (the number of unique value variations in a column) beforehand. The biggest cause of slow joins is a large number of records. By filtering out unnecessary rows and aggregating minute-by-minute data into daily data in the ETL tool before loading it into the DWH, you can minimize the size of the tables to be joined within the DWH.

It is also effective to perform master data cleansing and small joins outside the DWH. Performing joins while correcting notation variations within the DWH would further complicate the process. By completing master data cleaning in the ETL tool and optimizing join keys in advance, the DWH no longer needs to perform complex data conversions internally, allowing it to choose the least demanding calculation route without hesitation.

If the processing is limited to adding a few master attributes to transaction data, it is better to complete the joins on the ETL tool side during the load process and load them as a wide table (a table in which the necessary fields are joined horizontally in advance and compiled into a single sheet), which physically eliminates "shuffling" on the DWH side.

Cost-sharing join strategy

As mentioned in the previous section, the key to optimizing the cost of a DWH is to avoid moving data, so it is important to understand the types and differences of joins performed within a DWH.

①Shuffle hash join

This is the method chosen when joining large tables together. Data distributed across each node is redistributed (shuffled) to all nodes via the network based on the join key. Because the amount of data moved is enormous, network bandwidth is consumed and processing time increases dramatically. In this case, it is essential to design the ETL tool to narrow down the data in advance or create an intermediate table so that joining large tables does not have to be done in the first place.

②Broadcast join

If one of the tables is relatively small, this method distributes a full copy of that table to all nodes. The larger table does not need to be moved; each node simply checks the master data it has. This minimizes network transfers, making it significantly faster and less expensive than shuffling.

Many DWHs have automatic optimization features, but if the table size exceeds a threshold, they will forcibly switch to shuffling. This is where ETL tools come in. By pre-cleansing the master data and making it lightweight, you can ensure that broadcast joins are selected and minimize DWH charges. Rather than just thinking, "It's fine because the SQL works," being aware of whether the data is being shuffled or broadcast behind the scenes can make a difference of tens of thousands to hundreds of thousands of yen in your monthly cloud bill.

Summary: Data Engineering is Cost Control

When it comes to data utilization, attention tends to be focused on whether the data can be accumulated, but how efficiently it can be processed is also a very important point. By combining ETL preprocessing rather than relying entirely on the convenient ELT method, you can reduce wasteful cloud costs and contribute to increasing the ROI of data utilization.

The essence of data engineering is not simply to operate a system, but to "optimally control costs and speed" in line with business growth. Why not start by reviewing the cost performance of your company's DWH to see if there is any unnecessary "shuffling" occurring?

Saison Technology Online Consultation

Saison Technology Online Consultation

If you would like to hear more about our data utilization platform, we also offer online consultations. Please feel free to contact us!

Make an online consultation

The person who wrote the article

Affiliation: Data Integration Division, Data Engineer

Takayuki Sano

After joining the company, she worked on a data infrastructure construction project for a major manufacturing and construction client. She experienced the entire process from requirements definition to maintenance. She is currently using the skills she has cultivated through her work to communicate. Outside of work, she has many hobbies, including watching comedy, playing the piano, and hula dancing.
(Affiliations are as of the time of publication)

Recommended Content

Related Content

Return to column list