Sub-second network visibility - what are you missing?

Sub-second network visibility - what are you missing?

Standard network monitoring tools can lull you into a false sense of security and leave you with blind spots in your network.

High-performance network monitoring tools can be expensive to deploy and maintain.

stream2Cloud gives you the best of both worlds - low-cost, easy to deploy but with the in-depth analysis you would expect from a high-end solution.

The new normal

A trading application reports regular perceived delays on data arriving from the exchange. "My prices are late" - it's a common problem raised to infrastructure teams.

The chart below shows the utilisation on the 1Gbit link to the exchange with metrics computed every 10 seconds. This looks like the link is only at around 1.2% utilisation - looks good, right?

No alt text provided for this image

The next port of call would be to look for common problems with packet loss.

The TCP gap analysis shows a regular low level of gaps happening - worrying.

No alt text provided for this image

However, the TCP retransmission analysis shows that there are no observed retransmissions. This means that the TCP gaps are an artefect of the data capture layer. In this case - we know that the connection from the source of packets to the stream2Cloud endpoint is experiencing problems of its own, but despite that we can troubleshoot the problem further as we are confident that the monitored TCP sessions themselves are not showing retransmissions.

No alt text provided for this image

No TCP zero-byte window size events either - another common cause of perceived latency - when the receiving application becomes a slow-consumer.

No alt text provided for this image

Only when looking at the stream2Cloud microburst metrics can you start to see the signs of a problem - regular utilisation up to around 45% at the millisecond scale. Network bursts can cause delays and are a good indicator of congestion or components within your infrastructure that are misbehaving. Here you can see that this behaviour is fairly consistent.

No alt text provided for this image

Looking at the IP addresses involved - also broken down at the millisecond level - you can see that - although 10.101.1.32 (the exchange gateway) has the highest burstiness, there are no unusual patterns that stick out - the microburst behaviour seems to be happening uniformly.

No alt text provided for this image

Network Microscope

How can we look into this problem in more detail?

stream2Cloud captures the packets as well as generating telemetry, so we can drill down to the packets from this time-frame and see what can be observed with our microscope - CloudShark.

We can zoom in to a finer and finer resolution, looking to see where the 400Mbit bursts appear.

1 second - all looking fine.

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Peaks appearing at the 10ms level, but we can zoom in further still.

No alt text provided for this image

This starts to tally up with what we were observing at the 1ms in the stream2Cloud telemetry. In this case a single major spike during this 10 second interval.

Zooming in further we can see there are actually several bursts with large pauses.

No alt text provided for this image

CloudShark has a feature to allow for sub-millisecond analysis

No alt text provided for this image

Finally - the maximum level of detail - 200 microseconds - shows the full scale of the problem - huge bursts followed by large delays - 10s of milliseconds. Behaviour like this can be caused by buffering in switch ports - packets are not lost - just held up in-buffer before being forwarded. This is a more common problem than you might think with store-and-forward switches, especially with line-rate conversion - e.g. from 1G to 10G.

This kind of behaviour may not be important to many enterprise applications, but when your business depends on timely delivery of data - such as electronic trading - your network may be hiding some terrible secrets.

No alt text provided for this image

Take it for a spin

Find out more about how to get started with stream2Cloud - and sign up for a free evaluation.

To view or add a comment, sign in

More articles by Steve Rodgers

  • Troubleshooting TCP and FIX

    Following on from my previous article on microbursts - I'll now explore the impact in more detail by moving up the…

    2 Comments

Others also viewed

Explore content categories