Common Causes of Major Software Update Failures

Explore top LinkedIn content from expert professionals.

Summary

Major software update failures happen when new versions of programs or system drivers cause widespread problems, often resulting in downtime, data loss, or system crashes. These failures typically stem from preventable issues like insufficient testing, rushed deployments, or overlooked design flaws that can impact businesses and users globally.

Test thoroughly: Always check new software updates across a variety of real-world scenarios and system configurations to catch hidden bugs before widespread release.
Plan staged rollouts: Deploy updates to a small group of devices first, monitor for issues, and gradually expand to minimize the impact of mistakes.
Design for resilience: Build systems that allow for easy rollback of updates and anticipate user or hardware errors to prevent catastrophic failures.

Summarized by AI based on LinkedIn member posts

Paul Meredith

I build start-up and scale-up fintechs. I help fintech CEOs deliver annual revenue growth of £15m+, by leading and optimising the change and delivery function

12,847 followers 1y
Report this post
The biggest businesses can get major programmes horribly wrong. Here are 4 famous examples, the fundamental reasons for failure and how that might have been avoided. Hershey: Sought to replace its legacy IT systems with a more powerful ERP system. However, due to a rushed timeline and inadequate testing, the implementation encountered severe issues. Orders worth over $100 million were not fulfilled. Quarterly revenues fell by 19% and the share price by 8% Key Failures: ❌ Rushed implementation without sufficient testing ❌ Lack of clear goals for the transition ❌ Inadequate attention and resource allocation Hewlett Packard: Wanted to consolidate its IT systems into one ERP. They planned to migrate to SAP, expecting any issues to be resolved within 3 weeks. However, due to the lack of configuration between the new ERP and the old systems, 20% of customer orders were not fulfilled. Insufficient investment in change management and the absence of manual workarounds added to the problems. This entire project cost HP an estimated $160 million in lost revenue and delayed orders. Key Failures: ❌ Failure to address potential migration complications. ❌ Lack of interim solutions and supply chain management strategies. ❌ Inadequate change management planning. Miller Coors: Spent almost $100 million on an ERP implementation to streamline procurement, accounting, and supply chain operations. There were significant delays, leading to the termination of the implementation partner and subsequent legal action. Mistakes included insufficient research on ERP options, choosing an inexperienced implementation partner, and the absence of capable in-house advisers overseeing the project. Key Failures: ❌ Inadequate research and evaluation of ERP options. ❌ Selection of an inexperienced implementation partner. ❌ Lack of in-house expertise and oversight. Revlon: Another ERP implementation disaster. Inadequate planning and testing disrupted production and caused delays in fulfilling customer orders across 22 countries. The consequences included a loss of over $64 million in unshipped orders, a 6.9% drop in share price, and investor lawsuits for financial damages. Key Failures: ❌ Insufficient planning and testing of the ERP system. ❌ Lack of robust backup solutions. ❌ Absence of a comprehensive change management strategy. Lessons to be learned: ✅ Thoroughly test and evaluate new software before deployment. ✅ Establish robust backup solutions to address unforeseen challenges. ✅ Design and implement a comprehensive change management strategy during the transition to new tools and solutions. ✅ Ensure sufficient in-house expertise is available; consider capacity of those people as well as their expertise ✅ Plan as much as is practical and sensible ✅ Don’t try to do too much too quickly with too few people ✅ Don’t expect ERP implementation to be straightforward; it rarely is
No more previous content

No more next content
24 Comments
Like Comment
Tony Scott

CEO Intrusion | ex-CIO VMWare, Microsoft, Disney, US Gov | I talk about Network Security

13,662 followers 3mo
Report this post
In my time as CIO at Microsoft , VMware, and the Federal Government, the top causes of downtime rarely had anything to do with cybersecurity. We were pretty good at defending against attacks. The real headaches usually came from three internal sources: - legitimate users making mistakes - latent cancer-like software flaws like memory leaks or infrastructure monitoring failures - edge cases our testing missed If you want to build true resilience, I learned that its best to stop assuming that users are perfect and to start designing for human errors. Here are my favorite ways to catch these issues before they take down production: 1. Add friction to the Fat Finger moments The most common cause of downtime in my experience is a legitimate user trying to do a legitimate task but making a mistake, like confusing the test environment for production. Don’t just use a generic "Are you sure?" pop-up. Make the UI graphically show exactly what is being changed e.g., "You are deleting X Profile". For critical changes, require a two-person sign-off and build in a feedback delay. The system should tell you, "This change will execute in 30 seconds," giving the human brain a final moment to catch a potential error. 2. Monitor for the leakages and drift of performance over time, not just the big events. The second biggest cause of downtime in my experience are the slow burn activities like a small memory leak or creeping disk usage, that runs fine for months until it suddenly hits a tipping point and causes a cascading failure. Set your tools to watch the slope of consumption. Are we constantly using 1% more memory or CPU every week? If you aren't watching the trend lines, you won't see the cliff until you fall off it. 3. Do the "Tony Test" My dev teams would often tell me, "We’ve tested everything, we’re ready to deploy." Then I would walk over, sit on the keyboard, and press all the keys at once. Usually, the system would crash. Developers would argue, "Nobody does that." But they do. Keys stick. People spill coffee. Books fall on desks. Expand your testing boundaries to include physical accidents and illogical inputs. If your system can’t handle a stuck key, it’s not ready for the real world. You have to design for the things you didn't think to test, because eventually, someone will spill coffee on the keyboard.

6 Comments
Like Comment
Pragyan Tripathi

Clojure Developer @ Amperity | Building Chuck Data

4,048 followers 1y
Report this post
How a few bits in the wrong place broke 1 billion computers and brought the world to standstill? 𝗧𝗵𝗲 𝗜𝗻𝗰𝗶𝗱𝗲𝗻𝘁 Recently, a serious issue emerged with Crowdstrike's software on Windows systems, causing widespread system crashes (Blue Screens of Death). This incident highlighted the critical nature of system driver stability and the potential far-reaching consequences of software errors at this level. 𝗥𝗼𝗼𝘁 𝗖𝗮𝘂𝘀𝗲 𝗔𝗻𝗮𝗹𝘆𝘀𝗶𝘀 1. Null Pointer Exception: The core of the problem was a null pointer dereference, a common issue in memory-unsafe languages like C++. 2. Memory Access Violation: The software attempted to read from memory address 0x9c (156 in decimal), which is an invalid region for program access. Any attempt to read from this area triggers immediate termination by Windows. 3. Programmer Error: The issue stemmed from a failure to properly check for null pointers before accessing object members. In C++, address 0x0 is used to represent "null" or "nothing here." 4. System Driver Context: As the error occurred in a system driver with privileged access, Windows was forced to crash the entire system rather than just terminating a single program. 𝗧𝗲𝗰𝗵𝗻𝗶𝗰𝗮𝗹 𝗗𝗲𝘁𝗮𝗶𝗹𝘀 1. The error likely occurred when trying to access a member variable of a null object pointer. 2. The memory address being accessed (0x9c) suggests that the code was attempting to read from an offset of 156 bytes from a null pointer (0 + 0x9c = 0x9c). 3. This type of error is preventable with proper null checking or use of modern tooling that can detect such issues. 𝗜𝗺𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻𝘀 𝗮𝗻𝗱 𝗙𝘂𝘁𝘂𝗿𝗲 𝗦𝘁𝗲𝗽𝘀 Microsoft's Role and Crowdstrike's Response • Need for improved policies to roll back defective drivers. • Potential enhancement of code safety measures. • Potential implementation of automated code sanitization tools. • Consideration of rewriting system driver in memory-safe language like Rust. • Industry-wide discussion on moving from C++ to safer languages like Rust. 𝗕𝘂𝘀𝗶𝗻𝗲𝘀𝘀 𝗜𝗺𝗽𝗮𝗰𝘁 • Delivery services like FedEx, UPS, DHL face disruptions and delays. • Supermarkets struggle to accept mobile payments. • Major corporate IT worldwide struggles with point-of-purchase. • Major hospitals halt surgeries. • Airports ground and delay flights while engineers recover affected systems. • Repercussions spread to other platforms, with Amazon Web Services reporting issues. 𝗖𝗼𝗻𝗰𝗹𝘂𝘀𝗶𝗼𝗻 The losses incurred by businesses would easily be in 100 billions if not in trillions. This incident serves as a stark reminder of the importance of rigorous testing and safety checks in system-level software, especially in privileged contexts like drivers. It also highlights the ongoing challenges posed by memory-unsafe languages in critical software components. #crowdstrike #software #tech #microsoft #techtrend
No more previous content

No more next content
1 Comment
Like Comment
Edwin Marcial

Building Companies from Idea to Empire through Technology 🔹 Founding CTO of Intercontinental Exchange ($100Bn market cap) 🔹 Hired and lead team of 400+ engineers

8,741 followers 1y
Report this post
Could the colossal failure of CrowdStrike's Falcon platform that took down over 8 million Windows servers and caused severe damage to computer systems around the world have been prevented with basic software engineering practices including a better design, good QA and staged roll outs? Poor Design The bug was caused by a threat config file update in CrowdStrike’s Falcon platform that would cause the Falcon software to access invalid memory. The config file is often updated to make the platform aware of new security threats - sometimes it’s updated multiple times a day. Since the Falcon software runs at the OS level as a driver, the invalid memory access would cause the machine to crash and reboot. Upon reboot normally the bad driver code would be ignored, but not in this case - the Falcon software was deemed a ‘Critical Driver’ and so upon each restart, the OS would try to run it again. Preventing hacker malware is important - but is it more important than running the system at all? Instead a more robust design would allow for the system to reboot safely without CrowdStrike in the mix or at a minimum, give admins the ability to configure this as an option remotely. Microsoft's Windows weak OS design, which has been an issue for years, is at the core of this issue. It is important to note that many other machines Linux, Apple and other platforms were unaffected. In this particular case this update was specific to the Microsoft OS. In other cases however, Unix based OS have a better design in place that would protect it from the catastrophic doom loop that effected Microsoft Windows machines. Lack of basic QA Often software bugs can be very tricky only appearing in certain rare edge cases. These types of bugs are very difficult to catch and test. This was not the case. In this case, with over 8 M computers effected, it seemed to be happening every time. Some basic QA process should have caught this. A Staged Rollout Strategy would have been a better strategy When deploying critical systems, it's wise to release new code in a smaller, controlled environment first. This helps identify and fix any bugs that could cause catastrophic consequences before a wider rollout. This is a lesson we learned early on at ICE. We adapted by rolling out major updates to certain smaller markets first before rolling it out to the systems that managed Global Oil trading for example. If there was a bug in the initial rollout we could pause, fix it and then try again and the other markets would never have been affected.

9 Comments
Like Comment
Rob Kischuk

CEO & Founder at Bellwood 📱 - Building Great Software Products

8,423 followers 1y
Report this post
Whether you're a software vendor or an enterprise, how do you avoid a CrowdStrike situation? The primary responsibility here falls squarely on CrowdStrike, but there are lessons we can all keep in mind from this situation. These are problems we're solving every day at Bellwood 1. Integration Testing - CrowdStrike has claimed this was a "content update" that did not require the full rigor of a normal software update. It is self-evident that was a bad assumption, and justifying reasons that an update doesn't require testing has been the downfall of many a software team. Both CrowdStrike and their customers should have separately deployed this update to a LARGE sample of potential system configurations that mirrored their real world environment BEFORE the updates ever touched real-world production systems. It's self-evident this did not happen. 2. Staged Rollout - software rollouts this large should not roll out everywhere at once. They should go to 0.1% of systems, then 1%, 5%, etc (the pattern can vary, but the concept does not), after the rollout succeeds at each smaller scale. CrowdStrike has doubled down on the importance of rolling this sort of update faster because of the need to immediately protect from viruses, but this ignores that the cure can be worse than the disease. Vendors should not roll out updates of this scale at this speed, and enterprises need to build processes that enable them to control these rollouts and require that capability in vendors. 3. Automated Testing, including Fuzz Testing - the actual culprit of the problem seems to be a new configuration file that contained all "null" values. While this is certainly not standard process, it is a scenario that can and should be anticipated by automated tests on their core "Falcon" product. EVERY release of a product THIS critical should go through a massive battery of automated tests, including some pre-configured error conditions, as well as an ongoing randomized "fuzz" test where random and common error scenarios are automatically generated and run through EVERY release of the software to check for errors. It is 100% necessary that this sort of product pursues a scenario where it is IMPOSSIBLE for a corrupted configuration file to break the host computer. 4. Rollback - Don't build software or install where updates can't be rolled back. Much of the travel and business pains of the past few days could have been solved if CrowdStrike was resilient enough to support remote software rollbacks to reverse the remote "upgrade" they deployed. No software team should tolerate deployments at this scale that can't be reversed, but CrowdStrike isn't built this way. I'm always glad to talk through this sort of thing in greater detail - our Bellwood team has to think through these issues every day, especially as the products we build reach larger and larger scale. Are there any other lessons you think are relevant?
No more previous content

No more next content
4 Comments
Like Comment
Jordan Saunders

Founder/CEO | Digital Transformation | DevSecOps | Cloud Native

5,476 followers 1y
Report this post
It was a normal day at Starbucks—until cash registers stopped working at thousands of stores. Baristas couldn’t process payments. Lines piled up. Some locations even gave out free drinks just to keep customers happy. The cause? A software update gone wrong. A routine system update wasn’t properly tested before rollout. When the update was deployed, it crashed point-of-sale (POS) systems nationwide, costing Starbucks millions in lost revenue. Sound familiar? We see this all the time with software deployments and cloud migrations: ❌ Code is pushed to production without testing ❌ Rollback plans don’t exist ❌ No monitoring in place to catch failures early This is why DevOps and CI/CD pipelines matter. Companies need: ✅ Automated testing before updates go live ✅ Gradual rollouts instead of pushing to production all at once ✅ Incident response plans for when things break At NexLink Labs, we help companies deploy with confidence—so they never have to worry about an update turning into a PR nightmare. Have you ever been impacted by a tech failure like this?
No more previous content

No more next content
3 Comments
Like Comment
Gregory Raiz

Founding Partner @ FoundersEdge ⚡ | Exited Founder & Pre-Seed Investor | I invest in startup founders and help them build incredible businesses

9,388 followers 1y
Report this post
Three things to learn from Crowdstrike, even if you know nothing about software or security. 1.Avoid Major Deployments on a Friday Evening Deploying significant updates on a Friday evening is a common mistake. People often assume a fix is simple and rush to get it done before the weekend. However, Friday updates are frequently hurried through testing, leading to issues that can ruin everyone's weekend. 2. Plan a Phased Rollout for Major Changes When implementing major changes, it's wise to plan a phased rollout. Traditionally, this means deploying updates to one company, city, state, or country at a time. The fact that the rollout was instantly global is telling. Most major services segment their customers to avoid widespread issues. It's also common to have an early adopter group, representing a small percentage of the customer base, receive patches and fixes a few hours to a few weeks early. 3. Ensure Errors Can Self-Recover Mistakes are inevitable. In any large piece of software or technology, you can expect developers to make errors over time. What's important is to have a plan for when these mistakes occur. Great software anticipates errors and includes mechanisms to recover automatically. For example, Microsoft has "Safe Mode," Apple has "Recovery Mode," and Netflix uses a practice called "Chaos Monkey" to train for, anticipate, and properly recover from the unexpected. It's likely billions of dollars were lost in the Crowstrike incident but hopefully the real impact will be the companies you never hear about in the future who follow some of these basic principles.

8 Comments
Like Comment
Antonio P.

Professor and Director for the Artificial Intelligence and Business Analytics MS Program, at University of Texas at Dallas / JSOM - Researching AI, ML, Generative AI, Quantum Computing Algorithms and applications

4,629 followers 1y
Report this post
And we woke up to this today! A major issue with a software patch related to a Security Software, supposed to prevent disruptions, ironically, caused one of the major disruptions ever seen. While we are in the era of AI, we can't forget about some of the basic disciplines of Software development that could have been better applied in this case: - Change Management: You can't just apply a patch everywhere without proper testing. In a cloud based computing world, the impact of a wide updates are tremendous... and happen much quicker. - Automated testing: Functional, Unit and Integration tests are not an option, it is a must - Governance: Obviously this problem undercut the eyes of the companies because it was applied to the core of the cloud without much control of the final users. Companies will have to revisit this. In some computers, applying new patches is not an option. In this case, the auto security patch update turned out to be a really bad idea. So, moving forward with AI, you MUST have your basic Software Engineering disciplines in order before you can move forward with AI. The AI Maturity Model specifies these disciplines as a basis to exactly prevent catastrophic events like this. PROPER SOFTWARE ENGINEERING AND GOVERNANCE are more alive and important than ever. When a failure happens in AI, it will be MUCH MORE complicated to figure out the issue, in many cases. Don't try to jump to AI if you don't have your basics well established. If you do so, you will fall into a really bad situation. I am sure the war rooms everywhere are pushing hard to get things resolved and many competent people are in the case. I hope the disruption is quickly resolved and the world continues to operate relying in technology, the way we like, safe and stable. Zallpy Digital
No more previous content

No more next content
2 Comments
Like Comment
Brent Gallo - CISSP, Lead CCA

CMMC Assessor & vCISO helping DoD contractors pass CMMC Level 2 | CEO at Hire a Cyber Pro | Helping Business Leaders Identify and Reduce Cybersecurity Risks | M.S. Cybersecurity | CISSP | More Certs | USAF Vet

9,262 followers 1y
Report this post
On July 19, 2024, a bug in CrowdStrike's Falcon update allowed faulty data to pass its Content Validator, causing millions of Windows systems to crash. Here's a summary of what happened and how CrowdStrike plans to prevent future incidents. CrowdStrike's Content Catastrophe: ↳ Buggy update: A flaw in CrowdStrike's content validation system allowed a faulty update to slip through. ↳ Millions affected: Approximately 8.5 million or more Windows systems crashed after installing the update. ↳ The culprit: A configuration update meant to improve threat detection went awry. Trust in past deployments led to insufficient testing. ↳ Although the error was identified and the update was reverted within an hour, the damage was already done. Measures to Prevent Future Incidents : →Enhanced Testing Protocols ↳ Local developer testing, content update and rollback testing, stress testing, fuzzing, and fault injection. ↳ Stability testing and content interface testing. → Improved Validation and Error Handling ↳ Additional validation checks will be added to the Content Validator. ↳ Improved error handling in the Content Interpreter to avoid similar issues. → Staggered Deployment Strategy ↳ Implementing a small canary deployment before gradually expanding. ↳ Monitoring sensor and system performance during deployments to guide a phased rollout. → Increased Customer Control and Communication ↳ Providing customers with more control over the delivery of Rapid Response Content updates. ↳ Offering content update details via release notes, with subscription options for timely information. CrowdStrike has also promised to publish a more detailed root cause analysis post and more details will become available after the internal investigation is completed. PS: Are your update and deployment processes robust enough to prevent system-wide failures? Strengthen your protocols now to safeguard your organization’s operational stability. Ryan Peterson James Driscoll CySA Alex Arda Akyuz, M.S. CEH, CC Alexander Rhyne Andrew Wildrix Brandon Huschle Mario Forte Younis Ahmad Gabriel Tocci Michael Pifer Mike Morey Leo Szamocki Bill Heller Bill Ulbricht Bill Richardson Trent Tucker CISSP, CMMC CCP Melvin Scott Craig Walls Jared W. #CyberSecurity #DataProtection #CrowdStrike #StaySecure #CyberAware https://bit.ly/3WBh9U7

CrowdStrike: 'Content Validator' bug let faulty update pass checks bleepingcomputer.com

4 Comments
Like Comment
Aramayis Hovhannisyan

Quality Engineering

7,904 followers 1y
Report this post
This week, the IT community has been reflecting on one of the largest outages in history caused by CrowdStrike software update. While there have been many posts attributing blame to an intern or the company itself, it's crucial to focus on the lessons learned from this situation. Here are my key takeaways: 1. The Importance of Testing: Rigorous testing remains a cornerstone of the development process. Any code changes pushed to production can potentially disrupt the system, highlighting the need for thorough testing. 2. Adherence to Processes: Well-defined processes are essential, and everyone must adhere to them. These processes should be tailored to the specific context of the organization, rather than being generic solutions copied from the internet. 3. Guidance for Interns: Trust in your engineers is vital, but interns require guidance and mentorship. They need opportunities to learn and practice their skills in a supportive environment. 4. Impact of Individuals: A single individual has the power to effect significant change, whether positive or negative. This underscores the importance of responsibility and vigilance in every role. Let's use this incident as a learning opportunity to strengthen our practices and prevent future mishaps. #CrowdStrike #update #testing #bug #outage #lessonslearned
No more previous content

No more next content
1 Comment
Like Comment

Common Causes of Major Software Update Failures

Summary

More in Software Development

Explore categories