Cloud Native and Software Architecture: Navigating a New Era
Co-authored with Tim Lüecke
Key-Takeaways
Software architecture has a decades long history of established best practices and methods. However, in the last decade with the advent of cloud computing many things in software development have changed including of course software architecture. In our daily business most of us see the effect, but to our knowledge there is no systematic summary of what actually conceptually changed with cloud native. What best practices and outcomes are still valid and which have substantially changed? In this article we want to shed some light by systematically looking into the changes of software architecture in a cloud native era.
Interestingly, there seems to be no established definition of what exactly cloud native architecture is. Based on the definition for cloud native from the CNCF, from our point of view it would be software architecture for solutions leveraging cloud native technologies. Such architectures are then well suited to fulfill quality requirements like scalability, maintainability (through loose coupling and modularity), resiliency and operability (through automation). They allow "to make high-impact changes frequently and predictably with minimal toil".
Impact of cloud native on software architecture
To better understand how cloud native architecture differs from a “classical” software architecture, we must establish a reference point of what software architecture is. As it is well known, there are dozens of definitions out there of what software architecture is. The iSAQB - an initiative for standardizing professional software architecture terms - bases its definition on the IEEE definition (s. a good definition overview in its glossary here). Aligned with the iSAQB, arc42 is a broadly established standard for the documentation of software architecture. It is a lightweight structure with some advice and content directions well compatible with an agile development approach and consists of 12 sections.
The following table uses the arc42 structure as a guideline and provides our assessment regarding the impact of a cloud native approach on each architecture documentation area of arc42. Please note that the important findings here are not tied to arc42 itself. It simply serves us as a framework to systematically reflect on the changes implied by cloud native. A different standard or your own custom template for architecture documentation could do as well for the same purpose[1].
As mentioned above we see the major impact of cloud native architecture in how a cloud native solution is decomposed and how it is operated. In the following sections we will put a spotlight on how cloud native affects this decomposition in different architectural views and its implications to architectural trade-offs.
Decomposition in cloud native architecture
The building block view shows the inner structure of the system broken down into different building blocks (or components) with the dependencies between them. Each building block can thereby further be decomposed in more fine-granular building blocks described in an own lower-level building block view. In our experience, it is best practice to further differentiate between a domain building block view and a technical building block view. It allows us to tackle the technical and domain challenges independently from each other following the well-known separation of concerns principle.
The domain building block view tries to handle the complexity of the business by decomposition into functional building blocks with dedicated, unique responsibilities, e.g. “payment”, “vouchers”, “customer data”, and so on. This view remains unchanged in cloud native architecture. It remains equally important and the way the functionality is cut into different building blocks follows the same well-established patterns and principles.
We do see a risk, that due to the focus on the new technical possibilities of the cloud, the work being done on the domain architecture might be neglected. Often the focus is rather on the high-level architecture views with shiny Cloud-icon-based diagrams demonstrating the technical capabilities. However, this is not sufficient to guide the implementation of functionality which still bears a high complexity and requires a lot of design decisions.
You might argue that with cloud the domain building block view has to be split into even smaller components (aka microservices). But microservices is at first a technical concept, which is not 1:1 reflected into the business architecture. For sure there is an inter-relationship between those two. If you need to distribute your technical implementation into several decoupled services, your domain architecture has to support this approach by defining corresponding components. Another aspect is that cloud does not necessarily mean microservices. But we assume that for a system with a ”cloud architecture” you built some kind of distributed systems, so you need different domain building blocks you “can” distribute.
The technical complements the domain building block view. It handles the following concerns:
In Cloud environments the domain building blocks still need a technical representation, so the first concern still needs to be addressed. When looking at existing cloud architecture blueprints, e.g. from the hyperscalers, it might be seen that these define the technical architecture. But this is not fully true. If it comes to bespoke software you still need to design and describe how the system should work. You may adopt existing patterns for that e.g. hexagonal architecture, event-driven-architecture and so on. The cloud may give you a broader technology choice, but the design work remains the same.
Many cross-cutting concerns on the other hand are already well taken care of in the cloud providers service portfolio. Logging for example is usually achieved by writing logs to std:out and a collector of the used service collects those logs and presents them in a managed Logging Service. Similar for telemetry data and security features, cloud providers typically offer managed services for a lot of cross cutting concerns that integrate well with runtime services in the cloud.
Cloud specific vs. agnostic: the new trade-off
For these concerns a new kind of architecture trade-off decision space is opened between a cloud agnostic and a cloud specific architecture. The basic question is to what degree you want to depend on services provided by a specific cloud provider entering into a vendor lock-in. A cloud specific architecture makes deliberately high use of cloud provider specific services without any abstraction for decoupling from them. This allows leveraging the full potential of the platform and avoids overhead for abstraction layers. Examples of such an architecture would be leveraging a cloud specific data storage using only the native SDKs for that service without any abstraction in the application code or a FaaS architecture without taking care to properly abstract the logic from the technical function context.
In contrast a cloud agnostic architecture introduces abstractions to the all services of the cloud provider allowing the software to be redeployed with minimal effort to another cloud.
Recommended by LinkedIn
This limits the usage of these services to services that are standards and not vendor specific or that are behaving equally on the major platforms. Most cloud agnostic systems are using Kubernetes as a method of abstraction from the underlying system because it is available on all major cloud platforms as a managed service.
It should be noted that this is not an either/or decision to be made, but rather opens a range of possible trade-off decisions with above definitions pointing out the two ends of the range. Combining both approaches opens a huge potential to come up with a best-of-breed solution for each aspect of the software. One common approach for instance could be to apply a cloud agnostic architecture to the core application logic, while using cloud native solutions for specialized use cases such as data stream processing.
In our company, we use a lightweight shortcut version of ATAM to support finding the right decision. We simply list the quality requirements on one dimension of a table and the architecture decisions on the other dimension. We call it the ATAM matrix. The following ATAM matrix tries to outline the trade-off decision with respect to cloud specific or agnostic against typical requirements. We make no claim of the matrix to be complete or universally applicable. It serves only to highlight some of the trade-offs to be considered.
In summary, for the technical building block view the decision needs to be made how cloud agnostic or specific the solution should be. This needs to be done on a case-by-case basis for all cross-cutting concerns. For some concerns there might be open-source standards available, so the portability of using a cloud service is not really endangered. In other cases, the cloud specific services might be very hard to provision on an agnostic platform in time or might be very costly in terms of operating and securing them. The technical architecture should therefore keep a decision log for each of these options.
If services are used from the cloud provider, the technical building blocks are starting to be part of the technical infrastructure and hence shifting from the technical building block view to the deployment view. The shift becomes more significant the more cloud specific services are being used. Contrary the more agnostic the cloud architecture is, the more relevant the technical architecture remains.
Examples of technical architectures in the cloud:
The new kid on the block: IaC architecture
The deployment view shows the technical infrastructure the system is being built on, i.e. the database it uses, the application server machines as well as operating systems, runtime environments and so on. It is growing, because more and more concerns are shifting to existing cloud services that are part of the infrastructure.
One striking difference of cloud native architecture is that the technical infrastructure nowadays typically is also implemented in code (Infrastructure as Code, IaC). In the pre-cloud area, only the other architecture views were really implemented.
But not only is the number of topics increasing in the deployment view, also the frequency of changes drastically increases. In traditional environments dedicated teams took care of the infrastructure. New servers and products took a long time until they were introduced into that infrastructure. With a growing DevOps mindset shift across the teams the deployment view is as quickly evolving as the rest of the software. This leads to the same problems in the deployment view as with the rest of our system: it can erode and should therefore be put under quality control and frequent review.
IaC is one tool to put the infrastructure under version control, allowing reviews on changes and using it to define quality goals and generate an accurate deployment view from it. But IaC is also introducing a new area architecture must take care of. Therefore the importance of the deployment view is gaining much more importance for asserting maintainability and other quality requirements.
Like the traditional code of an application IaC needs versioning, testing and a maintainable structure (micro architecture):
Another major impact on the deployment view and the corresponding architecture work is cost management. With cloud computing and the consumption based pricing model, the costs of architecture decisions becomes immediately transparent and needs to be considered as well as monitored. Sizing was always part of the deployment view, however in the past it was often based on a worst case scenario solved by provisioning bigger infrastructure buffering usage bursts. For cloud native deployments, the scaling and adoption must be much more automated and requires much more fine-grained decision making. This can also have effects again on the technical architecture, when costs affect the choice of cloud specific services for cross-cutting technical services.
The changes in the deployment view reflect the impact on the involved teams with regards to DevOps. Operations can no longer work independently on their own, but instead must move closer to development. Infrastructure is no longer managed purely by governance or manual activities, but instead becomes software defined. This shows one of the challenges seen currently with emerging DevOps-Teams, where people with operational focus need software engineering techniques in how they implement infrastructure management via code.
Conclusion
Cloud native does not fundamentally change the way architecture is designed. We see little to no change in the building block view when it comes to business aspects. But in technical and infrastructure architecture we see larger changes. While the architecture tasks of the pre-cloud era remain the same, the cloud of course introduces a plethora of new options to be considered, amplifying the number of trade-off decisions to be made. So the world will become more complex for architects. In this article we tried to give orientation by a frame of references to consciously make these decisions and raise the awareness of what actually changed. We see the biggest change in the rise of IaC architecture and the need to identify patterns and methods for managing it.
[1] For instance, the C4 model of Simon Browne could be an alternative and is even compatible with the core architectural views of arc42.
[2] Please note that declarative approaches for IaC can impose challenges for establishing a maintainable structure. Those approaches often are missing clear interface constructs, which would be required to formalize dependencies between modules.
Ich bin bei manchen Begrifflichkeiten bzw. deren Verwendung immer so ein bisschen verwirrt. Eine Fraktion benutzt "Cloud nativ" als die Nutzung von (Software)-Querschnitts-Komponenten, die dediziert für verteiltes, skalierbares Computing entwickelt wurden: Container, distributed XYZ, usw. plus Design- und Architekturmuster, die dediziert darauf abbauen. Und dann lese ich, manchmal, Beiträge, die "Cloud nativ" interpretieren als die Verwendung von "(Cloud)Vendor-spezifische" Dienste nutzt, zB Kinesis, ComosDB usw. Das Gegenstück zu dieser Interpretation lautet dann "Cloud agnostisch" im Sinne von Vendor-neutral / portierbar. Wir selbst halten es mit "Cloud nativ" mit Fokus auf CNCF-betreute Technologien und Projekte UND "Cloud agnostisch" im Sinne kein Vendor Lock-In ;)