Find that root cause!
Software has given us fantastic opportunities. We have products and services available at our fingertips few could even imagine not too long ago. Features implemented in software are a defining element of our modern world. So are also software bugs.
How hard can it be to fix this? Isn't it just to fire those sloppy programmers that wrote the buggy code and let experienced people build the software instead? Yes, experienced people make fewer mistakes, but there will still be bugs. The inherent property of software enabling it to implement virtually anything you want also gives the side effect that it can do so many unforeseen things you never wanted or expected under specific conditions never foreseen. It is also the case that the bugs might not only come from coding errors, but also from inconsistent specifications, not understood interactions, procured tools or from a company management pushing development work 24/7 to meet a deadline.
If it is hard to get rid of bugs, then why not just accept them? One time out of hundred you start the dishwasher, it doesn't start, you have to pull the plug and reset. Maybe not a big problem, but then add the television, internet router, microwave oven and even worse your car, central heating or home alarm system. It can get quite stressful and dangerous with lots of sporadic glitches and failures.
We have gathered a lot of knowledge and methods how to minimize the risk of bugs in the industry. The basic thinking in my mind is to apply "filters" on top of each other to catch these bugs, in a similar fashion to the "swiss cheese" approach to maintain security in software, figure courtesy LDRA.
Source https://ldra.com/wp-content/uploads/2018/11/figure1.jpg
By doing proper architecture work on system and software level, by traceability from requirements via implemented code to test cases, by coding following guidelines such as MISRA, by checking the code by automated analysis tools as well as manual peer review, and by having independent testers really try to break your product, you could get rid of most bugs and unintended functionality by adding more "filters". However...there will always be bugs left, reaching the customer. This leads us to the world of finding that root cause for that annoying or dangerous bug.
In my mind, what really distinguishes a great software company from an average one, is the way these remaining bugs are handled. Data and observations from the end customer is worth its weight in gold. They can help identify bugs before they become too annoying or causes expected functionality to not work. Maybe it is a builtin error that today only causes small glitches, but in a subsequent upgrade might give a critical failure. A great company understands this, has built in data-collection methods and/or easy ways for the customer to report bugs. Customers are also kept informed about the status through an automated ticket system. A great user understands that the more relevant information is added, the easier it will be to solve the problem. Data showing under which circumstances, if repeatable, side events observed, a video showing the behaviour e.g. will be of great help for the software developers understanding what really happens and fix the code.
In regular development of features, overloading the team and working overtime is a bad habit that burns people and causes bugs. But when fixing critical bugs, overtime is often a must. All experts needed should be reprioritized, and development of new features sacrified if needed. Getting stability in the existing code is a must before adding more content. Noone wants to come into such a critical situation. One way to help avoid this is to treat customer inputed bugs, also those seemingly less important, as an additional quality "filter" of the software and use them to increase the quality of the product.
From my own experience, I know finding that real root cause can be a five-minute thing. But more often, it takes days and even weeks. Of course bugs need to be prioritized, but some are so critical you need to let them take the time and resources needed to find them. In my mind, it is a big mistake to send out fixes based on guessing what could be the root cause. This trial and error approach will likely not fix the issue, rather just annoy the user more, or worse temporarily obscure the bug which then could blow up later with worse consequences.
As a newly graduated engineer, I remember our entire team hunting a vicious bug causing unpredictable behaviour in a radio base station about to be launched to the customer. Weekend work was needed, we had lots of overtime pizza. The issue was finally root caused to be a bug in the embedded operating system that came from a supplier. I also remember working with an infotainment system for a brand new car model almost 20 years ago. We got reports from an experienced user at the customer that each time he drove out of a tunnel on the riviera the infotainment system crashed. Trying to reproduce this in a grey swedish weather tunnel gave no results. But close to the mediterranean, going in high speed on the autostrada out from a tunnel into the bright sun light...the exterior light sensor quickly jumped many values in a fraction of a second. The root cause? Exterior light sensor data changes was one of data sources set up for notification to other ECUs in the car. At each change, a big matrix was run through to check which notifications to send. Too many changes in too short time led to consuming all CPU time and subsequently a crash. When root causes like these are found, what a relief for the entire team!
(While writing this article the internet connection was suddenly lost in a television set causing the family members watching that streamed transmission to wonder what XXX happened. A software bug perhaps!)
The wonder with software these days is that it really is everywhere with no exaggeration, yet set only to be more ubiquitous. As a consequence both critical to things that we already take for granted and bugs can be critical and highly visible simply because "users" are so used to things just working as they should. I agree that transparency from software and whole product owners is key here because half the battle can often be dependent on users being open with their data too. And they are discouraged from doing that by amazon echos listening in their bedrooms (perhaps). How often do we trace a "bug" to an unintended use or exterior factor. Early ECUs were plagued with being knocked out by ground radar. A rich seam of humour there! The story of the TV reminds me of a recent case in the UK involving an old TV knocking out a village's broadband. https://www.bbc.com/news/uk-wales-54239180