DevOps and Cloud Native Architecture Guide
Importance of a proper software architecture and design principles to be “Agile”, adopt “DevOps Practices” or be “Cloud Native”. Focus on Why and How instead of Where.
Too often traditional enterprises big or small, who are adopting Agile or DevOps or Cloud tend to focus too much on the tooling without addressing some of the key aspects that are needed for success such as addressing Culture and necessary Software Architecture. Another way to think about this is instead of focusing on “Why” or “How” or “What”, we tend to focus on “Where”. Let’s take an example — Company X wants to adopt DevOps and Agile or they want to modernize their applications. They may already have some applications running in their data center which they would like to migrate to some form of cloud such as “Public/Private/Hybrid”. But quite often the discussion and planning start and focusses on which type of platform to run these applications on, what kind of CI/CD tools to buy or implement, which coolest container orchestration strategy to use or which latest technology stacks to use. While these are important strategies to formalize but they should not be the sole factors and should ideally be derived naturally and organically based on the desired software architecture which answers, “Why are we doing this?” and “How would we like to do this?”.
So, here is a short summary of how I like to think about this without focusing on any particular technology choice. This helps to form the foundation for a Cloud Native Architecture which is Agile and DevOps friendly by answering a series of questions that leads us to decide on some of the architectural choices and principles which in turn decides what technology choices needs to be made to arrive at the desired state.
WHAT does a Cloud Native Architecture mean for us?
Are we ready to answer “YES” to most of these requirements that today’s modern application architecture requires and focus on outcomes and not short-term output that is measurable?
Driving Business Value through Agility & Technical Excellence
- Are we building for a testable application architecture that promotes writing automated tests trivial?
- Can multiple versions of these applications be tested independently as needed?
- Are we building the architecture that promotes shorter delivery times, flow, and constant feedback loop?
- Are we enabling business the agility and flexibility to do frequent experimentations to understand business values of their changes?
- Are we having an organizational culture where Continuous Integration and Continuous Delivery is core focus in the delivery chain?
- Is our application code Agile that can change quickly and flexibly to business demands?
§ Think:
- The 2 Key DevOps Metrics here — Deployment Frequency, Change Lead Time.
- Small batch sizes and frequent deployments with micro service like architectures.
- A/B Experimentation, feature toggles, value stream mapping.
- Repeatable and orchestrated environments should be made available on demand through self-service API.
- Improvements to Developer Productivity.
- Very low Technical Debt.
§ Design patterns:
- Loosely couple architectures instead of tight cohesion in a monolith architecture provide us flexibility to switch out architecture pieces as things change.
- Adopt polyglot approach to find the best fit or technology requirement, instead of submitting to not so optimal choice.
- Very API driven that follows SOLID principles.
Driving Business Value through Operational Excellence
- Are we building a loosely coupled and distributed application architecture that can survive application performance degradation gracefully without impacting customers?
- Are we ready to build with the premise that “only thing constant is change” and a “production instability or failure” will always occur, and we need to build the architecture to adapt to it proactively instead of being reactive?
- Are we building to reduce our operational overhead & toil, whether in failure detection, remedy or recovery in an automated manner instead of manual interventions or at a minimum helps in faster resolution with help of automated means?
- Are we building where our applications can be deployed safely and in a repeatable way without requiring outages or can be deployed during business hours?
- Are we ready to automate every aspect of our configuration and operational management that is possible without needing any downtime or manual labor?
- Are we enabling telemetry into every deployment that can provide quick feedback and prevent avoidable failures or customer dissatisfaction?
- Is my application a single point of failure or how can I minimize my blast radius?
§ Think:
- The other 2 key DevOps Metrics here — Change Fail Rate, Mean Time To Recovery
- Optimizing Telemetry for the applications in production that provides quick feedback.
- Leveraging managed services or software that provide services continuous log ingestion and failure analysis.
- Stateless applications
- Intelligent Routing and load balancing
- Blue/Green Deploys, Canary Releases
§ Design Patterns:
- Create Stateless Application Architecture that can survive failures and scale as needed horizontally (Hint: store state in a distributed & scalable no SQL database)
- Create Event Driven applications instead of traditional request/response-based applications where it’s necessary to not cause upstream bottlenecks. (Hint: queues)
- Create a distributed and resilient Application Caching model that can store frequently used data and can survive failures
- Implement throttling for certain upstream API to reduce downstream impacts e.g. flooding a legacy DB or ERP system with calls that are slow to respond.
- Implement Circuit Breaker Pattern if back end fails to respond and retries also fail.
- Microservice architectures
- Service Mesh
Protecting Business Value & Brand Value through Excellence in Security & Reliability and Cost Optimization.
- Are our application & infrastructure changes passing through an automated security and governance check?
- Are all of our environments having required set of compliance boundaries (whether in security or in resource cost) with fine-tuned security without needing manual intervention and approvals every time a change is built?
- Are our application & infrastructure changes repeatable, does not cause a configuration drift and creates similar immutable environments automatically that adhere to all policy requirements?
- Are our application and infrastructure changes Highly Available or require minimal effort to failover to a Disaster Recovery Site?
§ Think:
- Infrastructure as Code that is automated to create or tear down environments as part of a pipeline.
- Programmable and self-documenting with automated, testable & repeatable playbooks that resurrect the DR application instead of pages and pages of manuals that requires manual and error prone work to stand up a DR site.
§ Design Patterns:
- A CI/CD Pipeline that works with Containers with automated security scanning.
- Scale out or Scale In as required to handle production or non-production workload and avoid cost overrun if there is no demand.
Conclusion
Hope you enjoyed this article and hopefully are able to leverage this in your real-life projects. Please do let me know in your comments your thoughts or advice so that we can learn from each other.
- Article Originally Published here on medium.
Some References and further reading :
- The Phoenix Project
- The Unicorn Project
- Accelerate
- Google DORA State of the DevOps Report 2019
- Cloud Native Patterns
- Cloud Native DevOps with Kubernetes
- Architecting Cloud Native Applications
- Cloud Native Foundation Landscape
- 12 Factor App
Biswajit - Thanks for sharing! One thing I'd like to address with you is the quote below. Have you investigated some of the scalable SQL databases in the marketplace such as CockroachDB? "Create Stateless Application Architecture that can survive failures and scale as needed horizontally (Hint: store state in a distributed & scalable no SQL database)"
Great article. I agree with most of it but I always doubt if Change lead time can "directly" drive clean up of technical debt. It can definitely be a 'carrot' but the one which is right beyond a glass wall which I may keep hitting to my frustration and never reach it. I believe code quality reporting tools (like Sonarqube) play much huge role in the reduction of technical debt and it may indirectly drive CLT(but also it may initially even slow it down) and yes, I agree, it might have been implicitly implied in every literature which advocates it but in that case it's just not fair for it to take away all the glory.
Very nice article, Biswajit Roy Barman. Thanks for sharing.
You have good writing skills. Think about publishing a book. It will be a best seller👍