Common anti-pattern in code


I came across a performance issue where we were migrating some code which was in a rpm to our own container. What we encountered was sporadic delays in flushing data we were consuming from kafka to the database which never used to happen before. We analyzed the issue in depth but believed that it was caused by overhead caused by the container since the code we believed to be identical.
Running a throughput test revealed we could handle 10X the traffic without any issues or delays. Looking at sar metrics the OS was not experiencing out of the ordinary at those times (no page faults, CPU spikes, nothing of the sort). The number of rows written to the DB represented a smooth line instead of the spikiness showed by the older app.
What we did see was that at times when the latency was maximum we were occasionally seeing very high GC times unlike in the old rpm version of the code. This led us to start thinking that there was potentially something wrong with the application. This is how I stumbled upon this piece of code:

List<Event> events;
while (true) {
Event event = iter.next();
events.add(event);
if (events.size > batchSize || (System.currentTimeMillis() - lastFlushTime.get()) > threshold)
flush();
}

flush() {
writeToDB();
events.clear();
lastFlushTime.set(System.currentTimeMillis());
}

At first glance this seems like very simple code. It ensures that events do not buffer longer than threshold number of millis or larger than batch size. However, as you look at this code a little longer one thing starts appearing clear that the upper bound of threshold number of millis might not be completely honored in all circumstances. It works when the rate at which messages are being pushed to the topic is steady. In certain cases the flush might not occur for more than a day even if the threshold was for example only 30 seconds.

What truly achieves what this code is trying to do with appropriate synchronization is the following :
while (true) {
Event event = iter.next();
events.add(event);
if (events.size > batchSize) {
flush();
}
}

Flusher extends Thread {
while (true) {
//call flush on the above object
Thread.sleep(threshold);
}
}

The code listed at the beginning tries to avoid an additional thread and synchronization effort that comes with it but ends up disguising a big bug which causes ANTI-THROUGHPUT issues. The latency between getting an event and flushing would be highest during low traffic :) I have seen this pattern in a bunch of open source software as well which is why I wanted to share this.

One side note to this issue is that we always need to challenge assumptions especially as we start eliminating causes. The code deployed on production as it turned out was not running the latest version of the rpm.

Here is a timeline which illustrates a scenario which causes this kind of issue.
Time T: 600 events produced
Time T+1 seconds: 300 events produced
Time T + 3600 seconds: 200 events produced
In this case if the time threshold was 30 seconds and batch size was 1000 events. The events would only be flushed 1 hour later from time T.

Abhishek Nice post. I believe you are iterating on Kafka messages and iter.next blocks until the next message which as you explained can be a day. Doesn't Kafka provide a way to return a message or block only X seconds. Wouldn't that be a better option than the thread ?

To view or add a comment, sign in

More articles by Abhishek Nigam

  • Evolution of web server architecture

    In the beginning we had multi process architecture. Every request was processed by a different process.

  • Scaling logging

    When my team started working on the logging pipeline we had a best effort pipeline which used UDP to send log messages…

Others also viewed

Explore content categories