Boztek

Why system resilience should mainly be the job of the OS, not just third-party applications

The recent US congressional hearing regarding the CrowdStrike incident highlights the potential role of automated recovery systems in preventing future cybersecurity crises. A suggestion emerged during the hearing concerning whether the responsibility for such recovery should lie with third-party software vendors or the operating system (OS) itself, raising the core issue of ecosystem resilience.

A central example discussed is the catastrophic boot error, often referred to as a blue screen of death (BSOD), which can occur when necessary software fails to load. This particular incident involved a corrupted update that led to significant global IT disruptions, illustrating the impact of software failures at a low-level access point known as ‘kernel mode.’ Such failures can trap devices in a BSOD loop, necessitating expert intervention to resolve.

To clarify, the analogy of a car engine and its spark plugs is presented. The analogy posits that just as a mechanic would replace faulty spark plugs to restore engine functionality, there should be a systematic approach to recover software states post-failure. The debate hinges on whether it is incumbent upon third-party vendors to provide recovery mechanisms or whether users should revert to stable previous states of operation managed by the OS.

The article argues for OS-managed recovery as a potentially more efficient solution than relying on individual vendors to create bespoke recovery functionalities. This would entail an OS registering updates from third-party software and retaining a backup of the previous working state, allowing automatic recovery options if a boot failure arises. In essence, this approach would streamline the recovery process by consolidating responsibility within the OS architecture.

A precedent for such OS-managed recovery exists with display drivers, which automatically revert to a default state upon a failure. However, the author notes the challenge in the cybersecurity realm due to the absence of a standard operative baseline. Nevertheless, the notion of retaining the previous functional state post-update could serve as a viable model for recovery mechanisms.

By embedding a recovery option that all third-party software vendors must adhere to, the OS could create a more resilient ecosystem capable of addressing failures autonomously. This collaborative framework between OS developers and third-party vendors would strengthen system integrity while deterring exploitation by malicious actors.

While the implementation of such a system would require substantial development efforts and regulatory frameworks, the proposed benefits in terms of resilience and systemic reliability are significant. Ultimately, this approach could mitigate the risks of widespread outages similar to those triggered by the CrowdStrike incident, emphasizing a shift towards a proactive recovery strategy in software management.