Looking beyond the infamous file 291: Drawing the right lessons from global tech outage

Cybersecurity firm CrowdStrike’s faulty update has brought into sharp relief the fragility of the global tech ecosystem.

On 19 July, millions of computer systems worldwide began crashing and displaying the blue screen of death. The outage did not discriminate among user types: from airlines to hospitals, from tech companies to broadcasters, and from banks to retail outlets all were impacted. While the chaotic dust has settled down and most systems have returned to normalcy, this tech outage will fade away in comparison to the ones in future if the right lessons are not learnt.

Anatomy of a tech outage
According to Microsoft’s estimate, 8.5 million or less than 1 percent of Windows devices were affected in the tech outage. The outage itself was caused by a faulty update by CrowdStrike, a cybersecurity firm based out of Texas, United States. More specifically, Windows devices hosting the CrowdStrike’s cybersecurity product ‘Falcon’ sensor that were online between 04:09 UTC and 05:27 UTC (about 78 minutes). Mac and Linux devices were not affected, and the impact was generally more pronounced for enterprise users as compared to individual users. According to an estimate by Parametrix, an insurance company, Fortune 500 companies lost about 5.4 billion dollars due to the global outage. Delta airlines alone lost 500 million according to its CEO Ed Bastian.

This was not a cyberattack or the doing of a malicious actor. What’s being touted as the biggest tech outage in history was caused by a mere 40 KB file (the now infamous channel file 291) that was pushed as an update by CrowdStrike for its Falcon sensor installed on Windows systems. For some systems, iterative rebooting solved the problem; while for some, it took days of manual and intense technical effort to get systems to work normally. Highlighting the deeply interconnected nature of the global tech ecosystem, the outage not just affected the systems receiving the faulty update but also the systems that directly or indirectly relied on the crashed systems. In addition to Microsoft’s cloud service Azure, Amazon Web Services and Google Cloud Platform were impacted as well. It is the cascading impact of a faulty 40 KB file that exposed the fragility of the global tech ecosystem.

Framing the lessons
How could a routine update be released by a reputed cybersecurity firm without fault-testing? It turns out that the update was tested for fault by CrowdStrike’s content validator before being released, but the latter had a bug that allowed the faulty update to pass through. In its Preliminary Post Incident Review, CrowdStrike has resolved to strengthen the testing of updates. That is the first lesson CrowdStrike, cybersecurity firms and in general software providers need to learn: thoroughly test everything before releasing for wider use. Staggered deployment with a limited set of users first toying with newly released updates before wider release is something that should be standardized. It goes without saying that even the testing systems should be tested themselves before deployment.

The second lesson is for Microsoft. The reason CrowdStrike’s faulty update was able to crash Windows systems is because the CrowdStrike driver had kernel level access (that is, the highest-level access to a system). The faulty update caused the CrowdStrike’s driver to fail, and the Windows operating system would not boot with a failed kernel driver — thus causing the blue screen of death. Notwithstanding the anticompetition concerns, Microsoft and other operating system providers need to rethink the level of access they provide to third-party software. It is not abnormal that many initially thought that this was a Microsoft outage.

The third lesson is for the global tech ecosystem. The tech ecosystem doesn’t just include tech companies broadly understood, but also every establishment having a tech element however small. The tech ecosystem is heavily interconnected and interdependent, with numerous single points of failure capable of crashing and severely impacting — in a cascading manner — a major chunk of the actors that make up this ecosystem. All establishments transacting in cyberspace need to diversify their vendors, including cybersecurity providers, invest in both redundancy and resiliency, and empower in-house IT teams rather than over relying on third-party providers. This tech outage has also weakened the arguments of naysayers of the Y2K (year 2000), who decried the problem as a hoax. If history is any indicator, the global tech community should take the Y2K38 (year 2038) and every embedded problem (with potential higher-order impacts) seriously.

The fourth lesson is for India. Barring some sectors like airlines, India did not witness major disruption. The country’s stock exchanges and banks were only minimally impacted. However, with India accelerating its digital push — from governance to finance — more and more systems are getting deeply embedded in the global cyber constellation. India’s National Cybersecurity Strategy, in the works for years now, should factor in the need for building resilience of the country’s core sectors, particularly transportation, medical, finance and trading, military and strategic establishments.

Author

Next
Next

European Union’s Carbon Border Adjustment Policy and Its Ramifications for Trade Dynamics of Developing Countries