AWS is built on the premise that "anything can break at any time"

This is Watanabe from the marketing department.
This is a column that casually writes about various topics related to data, IT, etc.

A major cloud outage was in the news (April 2025 incident)

Shortly before I was writing this (I'm writing this in mid-May 2025, and the incident occurred in April), there was an incident in which a cloud service outage caused many services to be suspended, which became a hot topic around the world. I would like to write about that incident today.

Specifically, the cause of the cloud service outage was a major problem at AWS (Amazon Web Services). Many services were affected by the outage, causing many problems around the world, such as smartphone payment services going down and people suddenly being unable to make payments, the system used to check in baggage before boarding a flight becoming unusable, and many smartphone games going into emergency maintenance and becoming unplayable.

As the use of IT continues to grow in the world, business is becoming increasingly inseparable from IT. The risk of cloud services failing and shutting down is directly linked to the risk of your own business being shut down. So, I'd like to write a bit about "cloud service outages" and "concepts for safety and security in the cloud era."

Cloud services can sometimes become unavailable

Nowadays, it has been some time since the term "DX" was first introduced, and there is a strong feeling that companies need to utilize IT. In such efforts, I think that "introducing the cloud" or "cloudifying" existing IT utilization is an initiative that is being enthusiastically pursued.

However, on the other hand, one thing that doesn't seem to be taken into consideration is that while cloud services are convenient in many ways, in reality they often have to be used with the assumption that they may stop working.

Is a "99.9% Service Level Agreement (SLA)" Realistic?

For example, let's say a cloud service advertises a service level agreement (SLA) of 99.9%. From the impression of the numbers, it may seem like they are offering a high level of stable operation. However, in reality, this number means that "once every few years, the service may be down and unavailable for a full day." If it were down from 9:00 AM to the evening, rather than from morning to night, and you couldn't work for the whole day, then even if it happened once a year, this level would still be met.

From personal experience, I think that anyone who has been using cloud services for a long time will have a "day when something goes wrong" about once every few years. It is rare to find a service that has gone through many years without any problems.

"Using the cloud" means that you have to assume that there will be days when the cloud is not working properly and you can't use email, or when Slack is not working properly. This is the reality with many cloud services.

However, there are many other risks and uncertainties in the world. There are days when trains stop running, and disasters can occur. There are also times when it is pointless to expect too much perfection from IT alone. There are also times when we just have to accept that "it's the way it is."

AWS is built on the premise that "everything will break"

So far, we have been talking from the perspective of cloud service users, that "the reality is that cloud services sometimes become unavailable." Next, let's talk a little about what it means to "maintain stable operation" in the cloud.

For example, let's say you're wondering whether to buy a car and are worried about whether it will break down. What would it be like if someone told you, "No, no, it's safe?" You would probably hear something like, "We use high-quality parts that won't break down," or "We strictly control the quality before shipping." In other words, you would be told not to worry because "it won't break down."

AWS (Cloud) is not a "will not break" concept

For example, AWS, a cloud service, is now used by many companies as the foundation for developing their own IT systems. In other words, it can be considered to be "at a level that can be used for business purposes." However, does AWS actually say that it "uses parts that will never break?"

Werner Vogels, CTO of AWS (Amazon), famously said, "Everything fails, all the time." In other words, AWS is premised on the premise that "anything can break at any time," and the basic approach is not "let's prevent it from breaking," but rather "in reality, it will break."

The idea is not "it won't break" but "it will break but it will recover"

Of course, we are not being irresponsible and saying things like, "It broke and the service stopped. That's just the way it is, so there's nothing we can do about it." Rather than focusing all our efforts on "preventing breakdowns," we accept the premise that "anything can break at any time," and instead adopt a mindset of realizing continuous operation of the entire system by creating a "mechanism that can quickly recover even if it does break." In other words, the cloud is a different "way of thinking" from traditional systems.

For example, rather than aiming for stable operation by preparing expensive, highly reliable hardware and striving to "never break down in the first place," the idea is to always run multiple virtual machines (Amazon EC2) and constantly monitor them to ensure they are operating without any problems (alive monitoring), and if it is determined that a virtual machine has failed, immediately detach that virtual machine and start a new virtual machine separately to recover and maintain operation.

  • A system and concept that keeps multiple virtual machines (Amazon EC2) running at all times, and if one breaks down, a replacement virtual machine can be used immediately to maintain operation.

By adopting this approach, it becomes possible to create systems using inexpensive hardware that can break down over time, making it possible to develop low-cost services.

What happened during the April 2025 outage (whether it was a multi-AZ configuration)

With that in mind, let's think again about the major outage that occurred in April 2025. We understand the concept of "breaking but recovering," but we might be wondering why an AWS outage caused services at various companies to be halted.

In fact, not all services that use AWS were uniformly affected; there were some services that were affected by the outage and stopped working, and some that were not affected.

In the previous example, we assumed that a virtual machine would break down, and we made sure that if it did, we could immediately switch over. However, sometimes that level of preparation is insufficient. For example, what if the data center itself (AWS generally refers to data centers as "Availability Zones") fails? Even if you have multiple virtual machines running, you can imagine what would happen if the entire data center went down.

The April 2025 outage was a datacenter-wide outage at AWS (AZ). It was caused by a power outage; the main power supply failed, so they switched to the backup power supply, but that also failed, causing a power outage, resulting in a "datacenter-level outage." Incidentally, a similar outage had occurred in 2019.

  • By "starting multiple virtual machines," you can prepare for the situation where "a virtual machine breaks down" by thinking "it will be possible to recover even if it breaks down." However, you cannot deal with the situation where "the entire data center breaks down."

In fact, AWS recommends that system developers using AWS "build a system architecture that prepares for AZ (Availability Zone) failures." In other words, they encourage virtual machines and other systems to be constantly "distributed and running across multiple AZs (data centers)" and monitored for uptime (multi-AZ). If they had done so, service would have been spared even during the failure in April 2025.

  • When starting multiple virtual machines, the virtual machines are started in multiple data centers (AZs). This allows multiple AZs to be used (a "multi-AZ" configuration).
  • By doing this, even if the data center itself were to break down, it would be possible to prevent all virtual machines from stopping at the same time, and overall service would not be interrupted.

Alternatively, the locations where services were suspended due to this outage may have not implemented the "multi-AZ configuration" recommended by AWS, or may have failed to take measures to deal with concentrated access to the surviving virtual machines when one AZ died, which caused the outage.

There were noticeable problems with social games, but since it was a game, they thought "it's okay to stop the service if something happens" (it would be cheaper to just go with a single AZ), so it's possible that they were intentionally leaving themselves unprotected.

  • There is also the option of not taking any measures, thinking that "temporary service outages are unavoidable" (accepting the "risk of outages if something happens" without taking any measures is also a business option).

Once you start worrying about safety and security, there's no end to it: "How far should we prepare?"

AWS recommends a multi-AZ configuration, but that doesn't mean you don't need to consider anything more.

This is because it is possible that all of the multiple AZs located in the Tokyo region (near Tokyo) could fail. Specifically, in the event of a major power outage across the entire Tokyo metropolitan area or a major disaster hitting the entire region, it is possible that not just individual data centers but all of the data centers near Tokyo could fail.

  • Using multiple AZs does not eliminate all risk

If we consider such a scenario, we will need to prepare for the situation where the region itself is destroyed. In other words, we will need a configuration that spans multiple regions (multi-region configuration). Specifically, we can consider a geographically distributed system configuration that uses both the "Tokyo Region" and the "Osaka Region."

  • If you're thinking about the possibility of an entire region being wiped out, you need a multi-region configuration.

However, this does not mean that there will be no problems if you go multi-region. It is not impossible for AWS itself to experience a major outage, or for AWS to be forced to suddenly suspend service due to social or political circumstances. Such risks can only be prevented by diversifying them across Google Cloud, Azure, or on-premises.

  • If you consider the risk of "AWS itself going down completely," you will need a "multi-cloud configuration."

When you think about "what ifs" like this, the number of risks that need to be prevented increases. And the more advanced the measures you take, the more difficult they become.

The question is how much of the company can handle it on its own.

When companies hear about these issues, they tend to make requests like, "We need a completely multi-cloud system!" However, the more difficult something is, the more technical skills and man-hours it requires, and in some cases, the difficulty level becomes so high that development cannot be completed (development failure is also a major IT risk). It is important to carefully consider how much effort is realistically necessary and possible for your company.

On the other hand, you may think that your company lacks the technical capabilities to handle the issues discussed so far, and that they have no connection to you. However, if something were to happen and your company's IT system were to stop working or your data were to be lost, and your entire company were to disappear, that would be a real problem. If an accident were to actually occur, you should think about how much your company can withstand.

How to protect your company's IT and data

A practical point to consider is that in reality, your company's IT is not a single IT system, but a combination of multiple IT systems and cloud services. There is no need to protect all IT systems in the same way, and some systems can be tolerated even if they go down for one day a year, and some can be temporarily fixed by manual clerical work.

If you think about it that way, it's likely that not all of the "areas that really need to be protected" that need to be strictly prepared for will be covered, which makes it easier to consider in a realistic manner.

Furthermore, there may be cases where "losing business data itself" is unacceptable, but "temporary system shutdowns" are acceptable. For example, if Tokyo were to be completely destroyed, would your company's systems need to continue operating normally? If so, you may find that this is not the case. In such cases, a realistic response may be sufficient, such as "we will take advanced measures to prevent data loss, but it is unavoidable that the system will temporarily shut down."

"Unbreakable IT" in combination with mainframes

After reading this far, some of you may feel uncomfortable with the idea that "it is assumed that things will break" in the first place. In other words, some of you may think that even in IT, you would agree with the idea that "we use highly reliable parts and have thorough quality control, so things will hardly ever break."

Though AWS-style thinking is mainstream in today's IT (and related businesses), it is also a "concept for the modern era" based on the awareness that there are other important things besides the thorough pursuit of quality and safety (such as rapid service deployment and cost of provision). Although it may be modern, it does not mean that "this concept is appropriate in all situations."

In IT, of course, there are situations where "it must not break down at all costs," "safety and security must be given top priority," and "stable operation at all costs" are urgent and important. This is the way to think about IT systems that involve human lives, and "IT systems that could have an irreparable impact on society if something goes wrong."

There are indeed IT systems that are built with safety and security as the top priority. For example, "mainframes." Mainframes are built with a great emphasis on stable operation, and are built on a different level from the hardware we normally use. Not only the OS, but even the CPU is uniquely designed to achieve high reliability, and a high level of consideration is given to everything.

Mainframes are now considered unpopular as "old-fashioned IT" and are called "legacy IT" and considered something that should be phased out, but in situations where "stable operation is of the utmost importance," mainframes are still a very reliable presence.

  • Our file integration middleware, HULFT, has also been supported as a product that can achieve "overwhelming high quality" on a mainframe level. It is a product that thoroughly realizes "peace of mind because it never breaks down in the first place," so if you need "overwhelming safety and security" for your IT system, be sure to consider it.

▼ Want to know more about HULFT Products?
HULFT 10 | Product Introduction Page

"Legacy integration" with mainframes brings out the best of both worlds

However, it is also true that it will be difficult to create all the functions of the "modern IT system needed in the 2020s" using only a mainframe. For example, if you want to make a system compatible with smartphones, it will often be difficult to tackle development using only a mainframe.

Therefore, the core parts of business that cannot be stopped can be implemented on a mainframe, and the "modern parts" such as smartphone compatibility can be implemented using new IT such as the cloud, and by using technology to "connect" the two, it is possible to combine the best of both the old and new IT (legacy integration).

There are many different ways of thinking about safety and security

So, we talked about a variety of topics, from the major cloud outage that had a widespread impact in Japan to "ensuring stable operation of IT systems."

As I wrote at the beginning, there is a general feeling that "you should just use the cloud," but in reality, cloud services sometimes experience problems. Furthermore, the service itself may be suspended or terminated. I think it's important to carefully consider whether this is okay (or whether some kind of countermeasure is necessary).

While the "quality" and "safety and security" of IT systems are often spoken of vaguely in the world, there are approaches where the "way of thinking itself is fundamentally different" when it comes to cloud and mainframes, for example. Even with the cloud approach, there are many options when it comes to "how far you should anticipate and take measures." Rather than just talking vaguely about safety and security, your company must decide which approach to adopt and how.

How to freely combine and utilize cloud services

In many cases, your company's IT will consist of multiple IT systems (such as systems implemented by each department) and cloud services. In that case, you will need to take appropriate measures for each part (extracting important parts into separate systems if necessary), effectively combine systems with different levels of security, and link them together to function as a whole.

For example, for functions that absolutely require stable operation, the mainframe can continue to be used, and parts that need to be developed on the cloud can be integrated (legacy integration) and used in combination. For systems running on the cloud, each can be handled appropriately, taking into account the risks of downtime. In this way, if multiple IT systems and clouds can be freely integrated and "connected," initiatives should proceed smoothly.

So, what you should keep in mind is our company's "HULFT Square" and "DataSpider." Using just a GUI, you can freely link with a wide variety of IT systems and clouds, and you can also "connect" with many cloud services.

"Connecting" technology can be extremely useful in cases where you need to implement advanced measures even just for data. For example, you can read data from a system running in the AWS Tokyo region, back it up geographically in the Osaka region, and back it up to the Google Cloud in a multi-cloud fashion, all with just a GUI.

▼ Want to know more about HULFT Square services?
HULFT Square │Service introduction page

▼ Want to know more about DataSpider Servista products?
DataSpider Servista │Product introduction page

"Resilience over strength" - a mindset for the cloud era

The idea of "breaking but recovering" is often adopted in cloud computing, not just AWS, and I think this can be used as a reference for many other things as a "new way of thinking in the cloud era."

Joichi Ito, former director of the MIT Media Lab, calls this new way of thinking "Resilience over strength," and says that it is a generally important way of thinking (one of the nine basic principles of thinking) in today's world, where change and uncertainty are increasing.

As I wrote in the discussion about mainframes, we must be aware that just because it is a new-age way of thinking does not mean it is a panacea for all situations (strengths are still important), but for us in Japan, who tend to rely on individual strengths such as a sense of responsibility and ability to complete tasks, I think it is important to consider that there is a different approach at the thinking level, and that this way of thinking is being adopted in realizing cloud services.

Furthermore, rather than the strength to endure without being affected by external forces, it is expected that a mindset of quickly responding to what happens and continuing to recover will also enhance the "ability to adapt to unexpected change." It is often said that "we are entering an era of change and uncertainty," but this, along with the fact that "strength" and "resilience" can be used in combination, should provide hints for thinking about our future, as we are forced to face an unpredictable future.

The person who wrote the article

Affiliation: Marketing Department, Digital Marketing Division

Ryo Watanabe

・2017: Transferred from Appresso Co., Ltd.
After majoring in information engineering (artificial intelligence lab) at university, I struggled in the development department of a startup.
・Small and medium-sized enterprise management consultant (as of 2024)
・Image: I took over the "Fukusuke" name that was previously used by our company.
(Affiliations are as of the time of publication)

Related Content

Return to column list