The High Cost of Ignoring Software Deployment Risks
This week, the global realization of several risks in software development and deployment came into sharp focus. A recent update from CrowdStrike brought the #BSOD front and centre, sparking global media claims of an 'internet outage'.
This resonates with points I recently discussed in response to "IT Underperforms Again, Again" by Prof. Bent Flyvbjerg , where I noted: "...risk management is often ineffective, with senior stakeholders willing to embrace substantial risks. The real challenge arises when they are reluctant to accept the consequences if these risks materialise..."
In the development and release chain, numerous risks are often accepted, assuming a low probability of occurrence. We understand that monitoring and security products operate at critical system levels to prioritize access and protect the system. However, errors at this level can cause severe, unrecoverable failures—yet antivirus and security updates are often considered low-risk deployments when perhaps they are high risk, low probability.
For such a widespread problem to have occurred there must have been systematic failures in multiple places.
Recommended by LinkedIn
So how do we mitigate these risks? Vendors will often use an early release 'eat our own dog food' approach, releasing all software to their teams ahead of public release. It's better to cause an outage of your internal systems than wipe $16B off your market cap.
This approach really should be deployed throughout the entire development journey from beta to general availability. Microsoft 's ring deployment methodology suggests a phased rollout to mitigate such risks, a strategy used across Windows and Office ecosystems, allowing users to choose their release track. If a similar methodology had been implemented, the scale of the problem could have been significantly reduced, likely affecting only non-critical systems.
In cases where complex race conditions make problems hard to pinpoint, some issues can be excusable. However, when a significant number of Windows machines have critical failures due to a software update, it's a clear indication of poor risk management.
Photo by Loic Leray
A couple of thoughts: Having implemented various ERP systems including system and IT upgrades I learnt early in my career why there is a test system, a pre-production system (an exact copy of the production environment) and the production system itself. You do not want to have the bugs entering your production environment - rigorous testing of both the bright side (are things going the way they should) and the dark side (what might go wrong) has to be done in both, the test and the pre-production system including the consideration of all knock-on risk (Domino stones falling) before even thinking of entering the production environment. If you are still uncertain, take a pilot system, where any impact can be ring-fenced. Does this give absolute certainty? No it doesn’t, but it clearly increases your chances of success. Planning and testing is EVERYTHING. Finally it might have been that protocols where followed, but that those protocols weren‘t any longer sufficient or not covering the development and the current state of the system environment. Too many times I have seen outdated protocols applied. Risk management is a continuous process and one should never feel safe. Some positive paranoia and scepticism is always helpful.
Yes and no 😉 See here: https://www.garudax.id/posts/flyvbjerg_software-development-activity-7220379023149838337-pdIw?utm_source=share&utm_medium=member_ios