That's a Load...

John O'Keefe

Published Aug 8, 2016

What if I turn it up to eleven?

This is Spinal Tap is a classic in a lot of ways, but one thing the fictional band knew was how to turn up the volume.

Have you cranked up your load testing? In my last large project, there were several critical components that had to be load tested individually and as part of the overall system. We all know that it just takes one bottleneck to spoil an otherwise perfectly good transaction. In our case, there were so many components and handoffs inside of a single synchronous transaction that one bottleneck could easily kill the entire system. But where were they?

To ensure we tackled each one, we first had to model all the components of the user transaction from top to bottom. Each sub-transaction needed to be instrumented and measured to help identify those bottlenecks. And there were

plenty. The transaction volume (i.e. "load" per time interval for this discussion) is estimated at a "busy hour" based on estimates or actual production data. Then all the different scenarios need to be played out, using "what if" brainstorming.

What if a busy hour hits and a heavy batch process is still running? What if one or more of the servers in the cluster fail? What if a database query is constraining the SAN? What if everyone tries to log in before the data reorg is complete? You know the drill, the scenario brainstorm is the fun part. Remember, it's not necessarily the obvious set of transactions that are the killers. The mixing and matching of transactions is critical too.

Then it comes down to building the scripts that let you pull the levers to match the scenarios. Don't skimp here. Your test environment should match production as closely as possible. Otherwise, while you'll certainly discover bottlenecks, they won't be the same ones in production. I've had the experience that the negative impact on a transaction can be severe when a production system performs faster than the test system. Because the bottleneck has moved. And a transaction component ends up waiting in line for something completely different -- an outcome that didn't exist in the test environment.

After the tests are documented, the analysis starts. In those scenarios, does the system perform within the SLA? Sometimes you have to answer the question: 'does it need to' at all times? In other words, does your SLA dictate a busy hour response? Or overall average? Smart negotiators will agree on SLAs that include volume estimates as well as performance metrics. No one can guarantee response time for an unknown quantity.

So, while functional testing is important, you also need to know how your system will perform under load scenarios as well. A properly constructed load test framework will pay dividends to indicate impact with any change to the system, especially upgrades. Crank it up to eleven!

Chris Trimper 9y

Nicely put. I would also add to the notion of 'crank it up' a 'mess it up' mentality. I've found that the test data may be just as important to the effort. We don't typically baby our test volume so why baby the data. Get as realistic as possible - study logs / observe long running transactions where you can - and in the end drop in some 'poison pills' to your testing. The effort will most certainly oh be worth it in the end.

1 Reaction

Tom Weisbeck, CSP 9y

John, what you wrote is very good project planning. Find the fail points, fix them before going live. All stakeholders, including customers, benefit from thorough failure mode and effect analysis. I

See more comments

That's a Load...

John O'Keefe

What if I turn it up to eleven?

More articles by John O'Keefe

Others also viewed

Spring - Problem when Mixing existing view resolvers with ThymeleafTemplateViewResolver

🚨 Key GC Metrics for Optimizing JVM Performance 🚨

Passing Struct Pointers to Functions ? Will it boosts efficiency?

ER Object Base Reference Error

Decorator pattern

All about TCP - Part 5 - Slow Start

T-SQL - Queries and Sargability

What happens if we migrate the ASMSPFILE from one disk group to another while one of our node from the cluster is down ?

Scoped, Transient, And Singleton Service

Logging the wrong details / entries written at the wrong level

Explore content categories

What if I turn it up to eleven?

More articles by John O'Keefe

DevOps Engineer

Others also viewed

Spring - Problem when Mixing existing view resolvers with ThymeleafTemplateViewResolver

🚨 Key GC Metrics for Optimizing JVM Performance 🚨

Passing Struct Pointers to Functions ? Will it boosts efficiency?

ER Object Base Reference Error

Decorator pattern

All about TCP - Part 5 - Slow Start

T-SQL - Queries and Sargability

What happens if we migrate the ASMSPFILE from one disk group to another while one of our node from the cluster is down ?

Scoped, Transient, And Singleton Service

Logging the wrong details / entries written at the wrong level

Explore content categories