Relevance of performance engineering in a cloud environment
Picture from https://analyticsindiamag.com/

Relevance of performance engineering in a cloud environment

Art of Performance engineering

I was fortunate in my career to be a web performance engineer/architect for more than a decade. From the early part of my career I was deeply fascinated by computers, hardware, software and distributed systems architectures. Along the course of my career as a Java geek I took on performance engineering as a side gig. Soon I started loving the job so much that it became an integral part of what I do everyday which is making websites faster. There is never enough learning and everyday there is a different problem to solve making it so much interesting.

Performance Engineering is a systematic and quantitative approach for the cost-effective development of software systems to meet stringent Non Functional Requirements (Performance – Capacity – Scalability – Availability – Reliability – etc). Performance Engineering is a software-oriented approach, focused on optimal selection of application architecture, design, and implementation choices with the objective of meeting Non Functional Requirements. Software Performance Engineering can also be defined functionally as the set of tasks or activities that need to be performed across the Software Development Life Cycle (SDLC) to meet the documented Non Functional Requirements. Software Performance Engineering is often viewed as the art of building systems that meeting Non Functional requirements within the allocated time frame and budget constraints. This engineering can be applied to systems either in cloud or on-premise.

Application performance improvements can dramatically impact an organization’s bottom line. A crash of even a few minutes can cause a loss of thousands or millions of dollars and break customer trust on the website, while finding the source of the error in an increasingly complex system can take time as well. This means user experience and the managed performance of your application must be incorporated throughout the app’s lifecycle, not just when it is first launched.

Cloud washing performance engineering

Visualizing complex cloud systems

With companies adopting public cloud infrastructure (AWS, Azure, GCP) and moving application workloads from on-premise to cloud there is a tendency to think that performance engineering and capacity optimization is unnecessary in cloud as we have almost infinite capacity to scale in/out and system components are automatically fault tolerant. In reality, performance engineering is even more important and complex in cloud infrastructure as we are dealing with many unknowns that companies do not control. If we look at few world class companies (pictured above) who adopt micro-services as the core philosophy have very strong performance engineering teams and supporting processes to identify and mitigate any performance issues before impacting customers. Using chaos engineering/SRE principles to minimize customer impact and gracefully degrade the experience has been the core philosophy.

Cloud technology has started producing new challenges due to the large scale of systems and much larger volume data that is generated by these systems. Real-time analytics is on high demand and expected to grow in the coming years, which will create challenges in the analyzing of a massive number of millions of events taking place every per second with real-time constraints. Shortly, real-time analytics can emerge as excellent support for performance monitoring. This will emerge as another rich area of research in the coming years. It is expected that real-time analytics with limitation of time will improve the cloud-based systems performance management.

How to build effective performance load testing models

Types of load testing profiles

Performance engineer should understand the application at its core, the capabilities it offers, it's intended use, and the kind of conditions in which application is supposed to operate. Engineer should also understand and know the limitations of the application in different operating conditions.

  1. List out the common factors that affects the performance of the application and consider these parameters while testing.
  2. Have a deep understanding of Environment under test and its capacity compared to production environments. This way we can scale up/down the traffic models.
  3. Set realistic & measurable non-functional SLA's or in SRE terms (SLI, SLO's) so we can validate metrics during testing.
  4. Design different load profiles like stress, endurance, breakpoint so we can test and document the application behavior.
  5. Consider organic (users) as well as non-organic traffic (bots, anonymous) when modeling tests. One might have a tendency to reserve server capacity for genuine users but in reality ~20% of e-commerce traffic is really bots (good vs bad).
  6. Most popular e-commerce websites use content delivery networks like Akamai, Cloudflare etc for caching static/dynamic content. These edge servers help offload traffic from origin to edge and move server capacity to edge regions. In case if cache needs to be purged during busy times of the day then application needs to scale elastically to handle the traffic. Traffic models should be designed to consider both cached as well as uncached experiences.
  7. Understand traffic patterns and promotional events. Find out times of day when traffic peaks and what are the peaks. Consider YoY traffic growth, marketing demands and SEO patterns so we can predict traffic patterns.
  8. Consider the percentage of registered users vs guests, different fulfillment methods, different payment methods, percentage of browse vs purchase traffic, cached vs uncached content on the website to build effective traffic models.
  9. Consider customers geo-location when modeling traffic patterns so we can validate network, ISP, DNS chokepoints or latencies in critical path. When replaying traffic patterns make sure traffic is spread across several regions just like user traffic patterns.
  10. In a micro-service or micro front end ecosystem there are lot of interconnected services and dependencies so ensure all the services are covered in the traffic model.
  11. Use right tools like (Real user measurement) and a good APM (Application monitoring) for forecasting website traffic patterns.

Difference between performance testing & engineering

Performance Engineering Phases

Performance testing entails certain processes and steps to determine faults, but performance engineering observes the entire system to identify where and how different pieces can be optimized. While performance testing usually is done after the website/application is developed or in the production environment, performance engineering is deep-rooted into software lifecycle to make sure the system is built with high standards to make sure it is optimized for performance from early phases. Performance testing uncovers bugs and bottleneck issues and provides analysis reports to developers for resolution.  Performance engineering is taking performance concerns to the next level by helping developers to meet business case requirements and industry standards for speed, scalability, and sustainability.

Do we really have infinite capacity on cloud

This is definitely a myth as cloud does not have infinite capacity. Public cloud providers like AWS, Azure, GCP and others have reserved huge volume of compute, storage and network resources that may be shared to many enterprise customers but still they are finite resources. During peak holidays customers choose to reserve/purchase dedicated capacity so that they don't have any capacity issues during holidays. This is more of a proactive approach and some customers choose to elastically scale compute capacity based on traffic demands.

Enterprises choose a subscription model (Pay as you use) where they pay cloud providers for used resources. This reduces the initial capital expenses needed to setup infrastructure. Performance engineering team plays a critical part by optimizing workloads to run on minimal infrastructure thereby saving cost. Every provider publicly provides cost structure for all the resources and customers can pick and choose the option that best meets business needs.

Evolution of performance engineering , AIOPS

When digital transformation outpaces IT performance management and hybrid infrastructures causes complexity this comes with a hefty price tag. That's where AIOPS (Artificial intelligence for IT operations) comes in. AIOps or artificial intelligence for IT operations is a term first coined by Gartner. It is the application of advanced analytics—in the form of machine learning (ML) and artificial intelligence (AI), towards automating operations so that Ops/SRE team can move at the speed that your business expects today.

AIOps marries big data with ML to create predictive outcomes that help drive faster root-cause analysis (RCA) and accelerate mean time to repair (MTTR). By providing intelligent, actionable insights that drive a higher level of automation and collaboration, team can continuously improve, saving your organization time and resources in the process.

AIOps builds real-time systems in the form of context-rich data lakes that traverse the full application stack in order to reduce noise in modern performance and fault management systems and drive automation—with the ultimate goal of improving time to resolution.

Valuable article explains clearly about PE/PT & Cloud Environment

Like
Reply

A very intriguing read Dhilip, thanks for sharing.

Like
Reply

To view or add a comment, sign in

More articles by Dhilip venkatesh Uvarajan

Others also viewed

Explore content categories