Reliability in software, managed system vs. engineered system
Information Technology (IT) systems for business built for the last several decades, have generally taken the path of build, maintain & support and evolve. But looking closely at the cost spent on the applications after the initial rollout, bulk of them has been spent on "maintain & support" and not in "evolve". It is well-accepted norm, that an extended maintenance and support team is needed to ensure that the applications operates properly meeting the functional and operational SLAs. Essentially, it adopts an approach of #ManagedSystem
However when you look at the other type of software which has in existence for more or less the same amount of time, namely the operational technologies (OT) mainly embedded software, that goes into plant automation, electronics in the cars, building automation, their lifecycle is significantly different. They were built & deployed and typically works for the lifecycle of the system with no or very minimal maintenance & support activities. Further investments were into evolution that typically goes into the next generation of the products only. (There was no easy way to deliver patches or updates to these systems anyway, so all needs to ensure smooth & safe operations have to engineered upfront). So can say that they take an #EngineeredSystem approach.
How is that these two systems have taken completely different path, even when advancement was happening in parallel? (Note: Point below are my own hypothesis & experience and does not include any structured research)
Managed System - Built with focus on features, design based on lifecycle of a request being handled, tested for working for features, primarily the happy path in execution
Engineered System - Built with focus on safety and reliability, design based on running for entire lifetime with good observability, and testing starting with boundary conditions & failure points first.
The technology stack, the tooling and development lifecycle that goes into building the systems and people & their educational background associated are all significantly different as well.
- Technology stack - Embedded system runs in smaller infrastructure footprint, as products could be made in millions (cost has to be low), or run in places where place is premium (so small installation footprint). Most of the times the compute power in large plant machinery could be several times lower than that of normal home computer. (But the key difference is special purpose hardware & software vs general purpose)
- Tooling - Development tools finally delivers highly optimized machine code (micro codes), and have good capability for simulation. (Simulate input signals for all I/O channels, run the micro codes, evaluate the output signals generated in I/O channels). Observability tools are well embedded into the design itself (Just open the control panel, look at the flickering LEDs, to do locate issues)
- Development cycle - Quite obvious, as there is no updates or very limited possible, make sure that the one production release is safe, reliable, feature rich & performant. (In typical business applications, these characteristics are offered across releases). In OT systems, there is risk to human life should there be a malfunction / defect in the application, so safety takes the top most priority.
- People - Very very closely connected with the physical world, obsessed with seeing things working in the real world, map back anything in software to real physical actions. Strong understanding of science & engineering of the systems being built, complemented with skills in algorithms.
While there are lot of conversations and evolutions that are focused on taking the advancements in Information technologies namely big data, analytics, AI & machine learning to improve efficiencies in operational technologies, it might also be good to look at how to get best of "EngineeredSystem" approach in OT to IT systems.
- Focus on boundary conditions and exception throughout the design to eliminate / reduce failure and/ or need maintenance & support activities.
- Reduce the footprint of the stack to specific purpose of the application even while using the general purpose systems.
- Get better tooling for simulation (IT systems though there is huge focus on testing, it is not being approached with a simulation mindset) in the development process and observability into the runtime environment.
- Huge mindset change required in people and their skill levels, should take pride in demonstrating that system (my code) will not break.
This would be required across all the layers in the stack used for realizing the IT applications namely operating system, platform, middleware, applications, monitoring & operational tools. In today's world, IT systems operations & management incurs huge effort and cost in just security patching and upgrades (being general purpose has exposed large surface area), monitoring (as the system cannot manage the boundary conditions), handling failed processes (insufficient exception handling & poor or nil self-recoverability)
Friendly note for the people who are taking IT advancements into OT world, please do not dilute the spirit & zeal with which OT systems have been built for decades (Do not rely on over-the-air updates to ensure that systems can be operating safe and effectively)
Note: Personally, I have worked on OT space more than 25 years back, did some modelling & simulation work, but for the good part of last 20+ years have been on the IT side. Still many times miss the charm of working in OT systems.