Why complex software systems often fail?
Rise of safety critical systems from simple enterprise softwares
In just a few decades, we have seen many simple web apps for enterprises convert into business critical systems. Some of them quickly became safety critical systems. So, any software we build for enterprise today have prospect of evolving in some form of safety. Looking at the Internet of Things and Big data paradigm this becomes even more evident.
If you don’t have safety criticality in your business software now, you will have it in future.
I will quote Dietrich’s words from the Logic of failure here.
Complexity is the label we will give to the existence of many interdependent variables in a given system
Failures and complex software systems
These systems are hazardous by their own nature. Frequency of failures can be changed, but complex processes in an enterprise are the perpetrator for giving rise to new failures.
Complex systems are heavily and successfully defended against failures. To defend themselves enterprises have to surround them with backup and other safety systems.
Further defenses are built against human elements by training and education of enterprise users. To add another layer of security, various organizational, institutional and regulatory defenses are built up in form of policies, procedures, certifications and team training. A wall of defense is built up.
Often, these defenses are design based on single failures, which when considered individually are easy to safeguard against. The layers of fail-safe implementation works really well for many that have simple software requirements, and operations are easily carried out without any hassle.
Great complexity places high demands on a planner's capacity to gather information, integrate findings, and design effective actions
Single point of failure isn’t the big deal...
But in complex systems, safeguarding against single point failures is not enough. Often small independent failures join together in a complex environment, and create catastrophic system failures. And, this is what separates complex software systems from the rest, it is almost impossible have a complex software solution that run without multiple software flaws present.
The nature of these flaws change dynamically, primarily due to changing technology that is continuously being integrated, dynamically changing workplace and continuous efforts that are present to eradicate such failures.
These systems still work
Despite the flaws, these complex systems work, and people make it function properly – albeit in a broken manner. For those, who think that these flaws should be pre-identified and diagnosed, clearly has never been a part of building complex software processes. Software system operations and usage is becoming extremely dynamic, even the components at the basic level (organization, tech and human) failing and changing dynamically.
Root-cause analysis becomes fundamentally wrong. Mainly because it fails to isolate “cause” of an event, when there are multiple contributors. Also, ‘root cause’ analysis isn’t technically correct and discards understanding of the software system failure.
Complexity is not an objective factor but a subjective one. Super Signals reduce complexity, collapsing a number of features into one. Consequently, complexity must be understood in terms of a specific individual and his or her supply of super signals. We learn super signals from experience, and our supply can differ greatly from another individual's. Therefore there can be no objective measure of complexity
While this post talks about software system failures, I will soon cover building complex adaptive systems that could deal with multiple points of failures and dynamic changes in general.
Have any thoughts to share? Drop a comment!