Performance & Reliability Engineering - Monolithic vs Microservices Application
Back in early 2000, when Java based n-tiers (where n<=5 usually) monolithic applications were ruling the world, the Performance & Reliability Engineering (PRE) was the holy grail of system architects. As a system engineer in a large bank and then reliability test lead for java-based enterprise applications I had a fair share of my role as black magician in identifying and fixing the performance and reliability issues.
The art of identifying the performance bottlenecks (OOM, slow response time) and reliability issues like slow memory leaks, database query response time degradation or Operating System file cache fill up required not only good knowledge and understanding of the underlying technologies but also required the experience in tools like LoadRunner, Rational Performance Tester, JMeter etc.
The test automation in any of these tools required a careful design of test scenarios simulating the real life workload and the test dataset that will reflect the enterprise scale and could be re-used and re-created on the fly.
For that reason the Performance & Reliability Testing (PRT) was called Performance & Reliability Engineering (PRE) to distinguish it from the regular test approach. The holy terms like TPS, Concurrent Users, Test Duration, and Test Data were not only a matter of hot debates particularly when discussing the performance issues with development teams but they were also critical in creating the right scenarios for re-producing the real life issues.
Over the years though PRT of Java-based monolithic applications became much simpler with the advances in tools like RPT, LoadRunner, JMeter etc. making it easier to write the automated tests that closely reflect the real life workload.
Objectives of PRE
Performance Engineering
The purpose of performance engineering is to identify the performance issues in the whole application stack. For this purpose the tests are constructed to simulate real-life load in the test environment. The performance tests can be broadly categorized into two kinds.
Benchmark Test
A performance benchmark is conducted against minimal recommended platform for the application. This gives the baseline for the concurrency and TPS for the application.
Load Test
A load test is usually a 1-4 hour test with the expected peak load on the application against a given infrastructure.
The ultimate objective of the performance engineering is to recommend the right underlying infrastructure and tuning recommendations for it that will result in optimal performance of the application. To achieve this performance engineer must work with the development team to fix any code related performance issues identified during the tests.
Reliability Engineering
Reliability engineering is to identify reliability issues (the issues that arise during the execution of application over very long periods usually in weeks, months and years), and also assess the overall availability of the application in a given period. The terms like mean time between failures (MTBF) and mean time to failure (MTTF) become hotly debated in the context of reliability engineering of the application.
Reliability Testing becomes very challenging in terms of simulating a real life long period due to the time and resource constraints. In my 6 years as reliability engineer, it was only once that I could simulate a real life workload of 12-months with a simulated workload over a 30-days period. Long before we actually hit a real application failure, we would hit a infrastructure failure with the potential to nullify the whole test effort.
Reliability engineering becomes very important for the Java applications running inside the Web Application Servers like IBM WebSphere, JBoss, Tomcat, WebLogic etc. One of the common reliability issue is the slow memory leak over a long period of time which in turn slowly result in out of memory (OOM) error in the Web Application Server. Similarly the reliability issues related to RDBMS arise when the database queries degrade in performance after sometime due to in-appropriate indexing and db tuning. Similarly reliability of the cache servers and distributed data structure servers (like hazelcast) are also prone to reliability related issues.
Reproducing these issues requires intelligent scenario design and efficient automation that would allow re-creating the required dataset. A typical Soak/long test would run for 72 to 120 hours under a sustained load with the right amount of enterprise scale data. The virtual users and the test scenario for the reliability test would cover the regular and intermittent peak use of the application.
Approach for PRE (Java Monolithic Application)
Let me take the example of a simple Java-based 4-tier application and define how we can approach the PRE for this application.
Test Automation
A safe assumption in the Java based monolithic application is to simulate the HTTP calls initiated from the browser during the use of the application. LoadRunner, RPT, JMeter, Gatling all such performance testing tools provide mechanism to record via proxy the HTTP calls from the browser.
However the recorded scripts require correlation, data pooling and logic to simulate the real life workload. Good test automators are supposed to be good programmers as developing a test scenario that closely simulate the real life workload would require logic and control. JMeter and RPT requires good knowledge of Java programming, LoadRunner C/C++ and Gatling Scala.
Monitoring
Monitoring the underlying infrastructure elements (both software and hardware), during the execution of performance and reliability testing is extremely important and tough task.
As the amount of data and logs generated during the test execution is enormous the monitoring infrastructure for the test environment usually becomes the bottleneck in itself.
From the little tools like nmon, snmp, jmx to the full scale enterprise monitoring applications like Tivoli, Appdynamics etc, are used to collect and correlate logs. Open source tools like nagios, zabbix and several others are also very useful.
Test Dataset
Having an enterprise scale dataset for reliability tests is vital. However the effort required to acquire and/or create such dataset is the greatest challenge in the reliability engineering approach. In my career I have used several approaches including the obfuscated real data from the customer to writing the scripts to create random but real-like data. Developing the backup and restore scripts for the data and then ensuring the health and sanity of data are some of the real challenges.
Data & Result Analysis
The real art of PRE is to perform robust analysis of the results and the data collected via monitoring the system under test. This requires not only understanding of the infrastructure pieces like RDBMS, Java, Network, Load Balancers, Web Servers, Web Application Servers etc, but also understanding the application under test.
Though both performance and reliability testing is considered as black box testing however a good understanding of the application architecture like spring framework, cache engines (hazelcast etc.), ORM, distributed data structures is vital in identifying the performance and reliability issues.
In the past one and half decade performance testing Java based monolithic application has evolved and have become much more standardized if not simpler.
Enter the Microservices
The PRE of the microservices based enterprise applications now faces the same challenge as Java-based monolithic applications faced in '00.
To get an idea of the scale of the challenge let us think of a microservice based enterprise application as an n-tier application where n > 100. Now apply the above approach for testing this application and you will realize how gigantic the challenge is.
In the remainder of this article I talk about the approach and tools required for the PRE of the microservices based enterprise application.
Let us consider a modern single page application (SPA) built on:
1. AngularJS
2. Springboot
4. Kafka
5. Cassandra
6. MySQL
7. ActiveMQ
8. Apigee
9. Zookeeper
10. Consul
And infrastructure:
1. Cloud (AWS, GCP, or on-premise cloud e.g openstacks)
2. Docker
3. Tomcat
4. Nginx
5. Load Balancers
Let's say the application consists of 100+ microservices, and they are meshed in a very complex topology with inter-dependency and discovery mechanism.
With the inherent asynchronous nature and support for resiliency and high availability the services are not required to be running all the time during the test. The infrastructure elements and the services are required to come up and down during the test and the performance and reliability need to be assessed under these conditions.
Good news is that as we are running the application on cloud, despite the large number of infrastructure elements we are not expected to find out the infrastructure related performance issues, unless we stumble upon some cloud performance issue.
Approach for PRE (Microservices based enterprise application)
So let us see how we can approach the PRE for this application.
It is important to remind ourselves that our objectives for performance testing are still the same i.e. to identify the performance issues in code and the infrastructure under a real life peak load of the application and also suggest the fix and tuning for the issues.
Monitoring
Good news is that monitoring of microservices based application is not only important in the context of PRE but it is essential for the operations. From monitoring stack ELK to commercial products like appdynamics, wily provide excellent monitoring and data analysis capabilities for monitoring the performance and availability of the microservices.
If the monitoring stack runs on the same cloud infrastructure it is important to make sure that monitoring infrastructure does not put an extra performance overhead. For the example in this chapter I would be using the ELK stack for monitoring infrastructure and the microservices during the performance and reliability test execution.
Test Automation
Test automation approach for the microservices based application performance and reliability testing is completely different then the monolithic applications. As the microservices expose the Restful API, performance bench-marking those API calls is usually trivial. Since modern cloud based enterprise applications are developed using agile/lean approach for software development, the microservices testing as part of the CI/CD pipeline is vital for a two weeks delivery cycle.
In this chapter, I would discuss the cucumber/ruby based BDD performance bench-marking framework for testing the API calls.
For the end-to-end performance test for the whole enterprise application, the test automation approach is based on the same principles as in monolithic application. However unlike the monolithic application where an application user usually interacts with the application via web page and round trip time for the whole web page is a good estimate for the performance of application. In monolithic web application one of the hotly debated topic was the page acceptable page response time. A 2-second page load time to several minutes was considered as acceptable depending upon the nature of user action.
In case of microservices based SPA, the notion of page load time is no more applicable as the page is inherently responsive and usually always visible to the user. It is the actual transaction or user action response time that matters and need to be measured.
The test scenario for the microservices based application can be designed on per business function basis. The end-to-end performance tests can be divided according to business areas and should be run nightly as part of the CI process. It is important to note that the traditional approach for recording the test scripts via browser using tools like LoadRunner, RPT, JMeter, Gatling is still applicable and recommended. But as the recording would essentially result in the RESTful API calls, the existing automation for them can easily be referenced.
In case of modern SPA, quite a bit of business logic is implemented on the JavaScript based front-end which executes in the browser, it is important to test the performance of front-end code as well. For this selenium based end-to-end tests for performance bench-marking should be executed when the system is under load. In the subsequent chapter I would discuss how to run the UI end-to-end test as part of the performance test while the system is under load via API stress test.
Finally while automating the performance tests, it is important to ensure that the test scenarios closely match with the real life use cases. These use cases could be strikingly different from a monolithic application due to the intrinsic characteristics of the cloud, for example availability zones, auto-scale etc. It is important to identify the common patterns for microservices performance and reliability issues. I would discuss those in detail based on our reference application.
(To be continued: This is part of my book - "Automation that works")