The Day of the Blue Screen of Death: The CrowdStrike Outage

Written by Leila Scola | Aug 9, 2024 4:49:09 PM

What could link flights being canceled, salaries not being paid, and surgeries being delayed? Only the CrowdStrike outage.

Airports, hospitals, TV stations, banks, shops, and even Formula One teams were affected by the massive IT outage that hit Microsoft systems after a CrowdStrike update went wrong.

We’re going to explore what happened, the worldwide effect, and most importantly, how to avoid a similar issue happening again.

The Facts

Millions of workers across different industries around the world logged on to their computers on Friday, July 19th to find the Blue Screen of Death (BSOD). Confusion reigned for a few hours until it was discovered that it wasn’t a lone issue, and it was affecting many Microsoft services. The culprit turned out to be a critical flaw in a single update, known as the Falcon Content Update by the cybersecurity company CrowdStrike.

All Windows computers running Falcon technology were down - a total of 8.5 million Windows machines. In their full Root Cause Analysis, CrowdStrike doesn’t fully explain how the issue wasn’t detected by automated quality assessments.

However, CrowdStrike CEO, George Kurtz, hastened to assure people that it wasn’t a security issue, it was a purely technical one. For a few hours, 911 lines in several states went down, flights were grounded, stock exchanges ground to a halt, and businesses had to stop work.

Even though the update was rolled back on the day it happened it took days for companies to go back to normal. It’s estimated that Fortune 500 Companies alone lost more than $5billion in direct losses, while CrowdStrike’s shares dropped almost 30%. The company lost billions of dollars in values.

Long Term Fix StratusGrid’s Perspective

Our CEO Chris Hurst sat down with the marketing team to give us his two cents on the CrowdStrike-Windows outage.

CrowdStrike works in the most intimate part of Windows, which requires a certification to be permitted to install software there, since it’s so close to the brain of the system, it crashing affected a huge number of other machines.

Chris continued by explaining that the outage affected so many businesses due to organizations running older operating models. Especially enterprises that use virtual machines (VM), with VMs you often have many different (often monolithic) applications running on one machine.

Running many applications side-by-side, on a single VM, increases the risk of one application affecting others. This is a common deployment model for legacy Windows Server-based applications. Since it is an outdated system design, and businesses should be considering how to rebuild on a more modern, decentralized system.

“In a cloud-native system, all your operating system functions would be baked into such small packages that you don’t have a broad attack surface. You don’t have a broad set of capabilities that can be exploited and need constant monitoring, like CrowdStrike. You’d never have a situation where your system couldn’t boot and not be able to start the next file. It’s just not possible”

If we consider cloud-native businesses that use containers, this incident couldn’t happen. All your Containers need to be rewritten and re-uploaded, and if it was faulty it could be recovered simply, unlike the CrowdStrike issue that was complex to fix.

It can be difficult for businesses to move from Windows servers, that they know so well, to a new, cloud system. Re-developing and testing applications to work in containers is time consuming and you need to have knowledge of how to adapt your applications to the new technologies.

If you were affected by the CrowdStrike outage, now is the time to reconsider the systems that you use. By working with a cloud consulting company you benefit from their expertise to help you move to a more secure, scalable system architecture. Click here to learn more about StratusGrid and how we can help you to analyze your current infrastructure and migrate to AWS efficiently.

Our Recommendations Are:

Adjust your development framework to a non-monolithic model
Your framework should leverage autoscaling so that you benefit from elasticity in the cloud
Plan your roadmap before you start migrating your technology
Choose containerized technology that is simple to adapt to, use and monitor, such as ECS Fargate

When you work with StratusGrid, we get to know your business goals and build a full framework and plan based on your objectives. We also work closely with you to ensure that your migration goes smoothly and doesn’t impact your business. Request a cloud maturity assessment today.

StratusGrid Can Help

If you’ve been affected by the outage, or are concerned about your current operating systems and infrastructure, reach out to us. StratusGrid is an AWS Premier Partner with over a decade of experience of helping businesses modernize their systems. We can provide a free consultation to analyze your current technology and provide a roadmap on how to improve it. Book a call with us today.

View full post