Visualizing Performance Benchmarks (3) - Start Small and Predict

John Beresniewicz

Published Nov 12, 2018

So far in this series we've seen some nice visualizations of elapsed time data for loading a large number of 5GB files onto a clustered data platform under varying degrees of load parallelism (number of concurrent independent load pipelines.)

You may wonder, why 5GB files? You may also ask, why start with 5 parallel loads? Well, these are very good questions and the answer lies in the theme to this article, which is to START SMALL AND PREDICT.

In the case of this particular benchmarking effort the goal was to investigate, quantify and understand maximum data ingest capacity of a certain clustered database platform. The platform was composed of a master node, a loader node, and 8 so-called worker nodes which is where the data ultimately landed in sharded Postgres database instances. Without going into detail, all machines had 20 compute cores and identically configured I/O subsystems. The load pipeline is a client-server application where data fed into the client program is parsed and valid tuples forwarded to the server program. The load server hashes the distribution key and forwards tuples to their correct shard (based on the hash.) The load server is multi-threaded with numerous hashing threads and tuple buffers for batching data to send to the shards. So the data ingest pipeline looks like this:

The red annotations indicate the primary compute functions performed by each of the three components of the ingest pipeline:

Client: reads and parses data (read I/O, some CPU)
Server: hashes tuple keys (heavy CPU) and sends tuples (network I/O)
Database shards: write tuples and replicate (write I/O, some CPU)

START SMALL in this case means understanding a single load pipeline very well before scaling up to multiple concurrent pipelines, which will presumably be necessary to measure total ingest capacity of the cluster. The pipeline starts with the client reading individual files sequentially, and the first question to answer is: what size of files to use? If we have 5TB of data is it better to load it as a single file (probably not) or multiple files (probably) and in the latter case is there an optimal file size to achieve maximum throughput?

So this benchmarking exercise began with setting up a single load pipeline and measuring load times with different file sizes. Using file ranging from 1MB to 100GB in size and loading 100 files of each size (except 100GB, which loaded 50) I collected data from five runs at each size. Measured elapsed times were converted to ingest rates as that is the target metric of interest for the study. The data were plotted using box-whisker plots as below.

Box-whisker plots (in R ggplot package) present a statistical summary of the data graphically. The "box" is drawn between the first and third quartiles with the median the horizontal line inside, so 50% of the data lies "inside the box." The "whisker" lines extend above and below the box 1.5 times the range between the first and third quartiles. Outlier points are individually plotted. Note that the boxes are very flat, meaning the distribution of measurements is not spread out, which is good (repeatability.) It is interesting to see several significant outliers at both 100MB and 1GB file sizes, but the distributions are still quite compact so this is not of great concern.

The red line connects the median ingest rates for each size and gives a clear picture of how ingest rate varies by file size. The severe penalty for smaller file sizes is explained by the fact that each file that enters the pipeline spawns a new load server process, and so process creation time dominates the actual time spent hashing and distributing tuples for small files. The throughput gains between 10MB and 1GB are substantial, but fall off considerably above 10GB. So the sweet spot seems to be between 1GB and 10GB. I selected 5GB to get most of the gain but at half of whatever overhead (e.g. memory buffers) is induced by 10GB file size. Note that understanding the mechanics of the processing (spawned server process per file) was useful in interpreting these results. Black-box benchmarking with little operational understanding of the software being studied is more anecdotal than explanatory, and therefore less trustworthy.

Linux resource usage measurements for disk and CPU were collected during the runs using the atop performance monitor. This tool is excellent for watching system performance interactively as well as collecting and storing many important performance metrics for offline analysis. The plots below show CPU and disk usage for the load client, load server, and database shard machines. I should have mentioned earlier that these single pipeline tests used the cluster master to host the load client and input files, and the loader node to host the load server processes. Of course the database shards occupied their own hosts as well, so all components of the ingest pipeline were isolated on separate machines. This allowed the resource footprint of each component to be analyzed with minimal interaction effects, enabling a much better understanding of the process and potential bottlenecks.

The plots below show average CPU and disk resources used at each of the tested file sizes, per load pipeline component. In these plots the red line is the load client host, the green line the load server host, and the blue line the DB shard hosts.

The CPU plot charts "Average Active Threads" which is analogous to the "Average Active Sessions" metric we use in Oracle performance analysis (see my presentation post.) You can think of this as how much CPU core is being used by the process. Note how the lines cross between 10MB and 100MB file size. This is where the overhead for small files gets dominated by the actual work being performed and most of the CPU used is spent hashing tuples (green line) as we expect. At the selected 5GB file size, the load server host uses almost 2.5 cores of CPU for this single pipeline. Projecting ahead to multiple pipelines and keeping in mind the load server has 20 cores, we might expect to see evidence of CPU saturation at 8-10 load ingest pipelines (although the machines are hyper-threaded so 20 cores is not a hard limit.)

On the disk side, the charts plot "average disk busy percent" which is conveniently measured and recorded by atop. Here we see that the file input host (red line) is almost 3 times busier than the database shard hosts (blue line) at almost 15%. This indicates that the multiple concurrent pipelines may bottleneck on file input capacity at approximately 6 or 7.

By understanding a single ingest pipeline to this level of detail, we can make some ballpark predictions about what to expect in the multi-pipeline case:

Using file inputs from a single machine, I/O capacity will bottleneck at 6 or 7 concurrent ingest pipelines (100% / 14%.)
On this cluster, a dedicated load server host should support at least 8-10 concurrent ingest pipelines, and possibly significantly more (20 cores / 2.5 cores.)
The total write capacity of the clustered database machines should support up to 20 concurrent ingest pipelines (100% / 5%.)

This segment was not about fancy visualization as much as benchmarking fundamentals. The ultimate objective was to benchmark total ingest capacity of the clustered database, and we could have begun with slamming as much data as possible using multiple ingest pipelines from the beginning. However, STARTING SMALL by examining and understanding a single pipeline reasonably well equipped us to better understand and interpret multi-pipeline results. We can also PREDICT bottlenecks in advance and then test for accuracy, confirming or rejecting the validity of our understanding.

We did make good use of data visualization to arrive at a reasoned decision about what file size to load as well as to understand the CPU and disk resource footprints of a single ingest pipeline at various file sizes. The same conclusions could have been made seeing the data in tabular form, but the plots expose interesting features in the data more clearly and provide more confidence in our understanding.

Visualizing Performance Benchmarks (3) - Start Small and Predict

John Beresniewicz

More articles by John Beresniewicz

Others also viewed

The Hidden Tax of Visibility: The Exponential Cost of JSON at Scale

The confusing CONSISTENCY in CAP and ACID

Transaction Logs in Delta Lake

Delta Lake — The Reliability Layer (Deep Dive)

Big Data, Big Bills: How Wrong Serialization Formats Drive Up Storage Costs

Why Do Databases Prefer B-Trees and B+ Trees?

Dissecting the Delta Tables in Delta Lake

Best Practices for Designing Cosmos Data Models

Explore content categories

More articles by John Beresniewicz

Advanced ASH: decoding the bitvec

Estimating OLTP Execution Latencies Using ASH

ASHviz: Dark matter 2

ASHviz: Fiddling with violins

ASHviz: Densities and dark matter

ASHviz: Can you box that, please?

ASHviz: Issue at the x-axis

ASHviz: Accidentally good

ASHviz: Visualizing ASH dumps with Jupyter Notebooks

Visualizing Performance Benchmarks (4) - Validate, analyze, conclude