How do we maintain an ideal IT environment in the cloud?
Introduction
Now that we have designed and built our system, we need to keep it running. Unfortunately, this phase is often neglected during the design and build phases, leaving operators to "figure it out". Having your operations and development teams work together during the design and build phases can alleviate many of these issues, a growing trend that is known as DevOps.
Part 3: Ideal System Operations
The system is built so that it can be monitored
Obvious monitoring would include things like CPU, memory, and disk utilization, but monitoring should be extended to features related to the system or service. Metrics like transaction time and connection duration are important to figure out how your system is handling a real-world situation. Monitoring also allows many small issues to be resolved automatically.
There is documentation to resolve every known issue
Documentation should include not only error codes or exceptions, but responses to alerts generated by the monitoring system. If an error occurs frequently it should be investigated so that future events can be mitigated.
Service features can be individually enabled or disabled
This allows for testing, vulnerability mitigation, and feature roll-out to select customers. This also allows a newly upgraded feature to be rolled back without affecting the entire system. On a global scale, there is no acceptable downtime to resolve issues related to new features.
The DevOps teams schedule failures to test the system
Working together through failure builds confidence in the system and the teams involved. If the system was built correctly, it should be able to handle failures without impacting staff or customers. Having the development team involved during these exercises allows for changes to be made where required.
Thank you
Thank you for reading my view of an ideal System Operations environment. This has received way more views than I thought and has been a fun exercise. Hopefully, it has inspired anyone in the tech field to look at their processes and make changes to benefit their environment.