Fast Visualization of Big Data Streams Using a Sliding Offset Approach with Apache Kafka and VIPER Visualization
Over the course of a year, I have tried and failed to find off-the-shelf visualization software to visualize large amounts of data that is fast, secure and does not require expensive hardware and software. I could not find anything that satisfied my need. I like simple, and in the big data world, it seems to me, there are many SQL-based approaches that work well on querying large amounts of data very quickly, but fall short on approaches to visualizing big data in standard web browsers.
Data streams by their very nature will accumulate very quickly and in a matter of minutes you could have gigs of data in your cloud storage. This blog will show how these data can be easily and quickly visualized in standard browsers using web sockets over HTTP/HTTPS connections.
Standard web browsers were not built to visualize large amounts of data. Companies like Datadog have some cool stuff for cloud based monitoring but they are still at the mercy of standard web browsers. If you are like me, you probably have lots of browser tabs open, which are all using up memory leaving very little resources left over for visualizing big data graphs.
Luckily, Kafka offers some very powerful features that can be utilized to visualize big data. Specifically, when lots of data goes streaming into a Kafka topic, it will be stored in partitions by Kafka. These partitions in a topic allow you to store unlimited amount of data that make them ideal for big data. The challenge comes when you want to access these data. This is because when your data goes into Kafka, Kafka will choose the partition to optimize storage. You can ask Kafka to return the partition number back to you, for future access. Partitions are also how Kafka manages concurrency. For example, say you have a Topic in Kafka called "CustomerPurchases" from millions of customers. This topic could contain millions or billions of product purchases. Then, say, you have 10 people who "concurrently" want to visualize these product purchases every time someone makes a purchase.
For example, say you have a Topic in Kafka called "CustomerPurchases" from millions of customers. This topic could contain millions or billions of product purchases. Then, say, you have 10 people who "concurrently" want to visualize these product purchases every time someone makes a purchase.
Note the word "concurrently". When visualizing the topic - each individual should get the information at the same time, with very low latency. The way Kafka handles this is to create the CustomerPurchases topic with 10 partitions. As data goes into the CustomerPurchases topic, the data are replicated across the 10 partitions by Kafka - by this method - it can deliver the information to the 10 consumers almost simultaneously - PRETTY COOL!
Ok - how do we visualize big data streams with Kafka and Viper visualization? Viper visualization is binary that is integrated with Kafka, and comes with a built in webserver that serves standard HTML visualization to standard web browsers over HTTP and HTTPS using web socket connections. It is fast, secure, and gives you full control over how much you want to visualize. Even better - it requires no special hardware and software. Below is a url that you use to access Viper visualization (after you have started it):
https://[host]:[port]/[HTML file]?topic=[Topic Name] &offset=[Offset, set to 0]&rollbackoffset=[number of offsets to rollback the data stream]&topictype=[anomaly, prediction,optimization]&secure=[1 or 0]&append=[1=append all data to the web table, 0=do not append]&consumerid=[Consumer ID for the topic]&groupid=[Group id to consume parallel messages]&vipertoken=[copy/paste the token in ADMIN.TOK file]
I highlighted the key terms that control visualization of big data streams. These are as follows:
1) topic - this the Kafka topic you are storing all your information in. For example, CustomerPurchases.
2) offset - this is an offset number, and it indicates the location of your data in the data stream. Offset=0 means your data is the "first" data point.
3) rollbackoffset - this parameter will tell Kafka/Viper visualization to rollback a data stream to create a sliding window of data. More on this below.
4) append - this will tell Viper visualization to append data to your visualization or not.
5) groupid - is the id of the group that stores my topic. This allows us to do parallel processing across 10 people, for example.
Now, because we are in the big data world with data streams, your data can, theoretically, grow to unlimited amount in Kafka. To create a sliding window of the data, what we want to do is tell Viper and Kafka to get the "last x number of data points from all partitions" continuously.
Now, because we are in the big data cloud world with data streams, your data can, theoretically, grow to an unlimited amount in Kafka. To create a sliding window of the data what we can do is tell Viper and Kafka to give me the "last x number of data points from all partitions" continuously.
This can be done as follows:
1) Set your offset=-1 - this tells Viper and Kafka to go to the "LAST" offset in the data stream.
2) Set your rollbackoffset=x, where x is any number. But, be careful, a large number means more data will get pushed to your browser. Specifically, if your topic has 10 partitions, and x=100, then the amount of data that will be pushed to your browser is 100x10=1000 data points every time there is an update to the Kafka topic.
3) Set your append=1 or 0. If append=0, then 1000 data points replace the last 1000 points. If append=1, the 1000 data points will get appended to the last 1000 data points. You can see how this can be a problem if you set append=1.
For example, using offset=-1 and rollbackoffset=100, and append=0, will result in Kafka going to the end of each partition in the data stream, roll it back by 100 offsets and grab the data and send it to you. Specifically, if the last offset in the data stream in each partition is 1,000,000, then rolling it back by 100 offsets is going to be the new_offset=1,000,000-100 = 999,900 in each partition. Kafka will grab the data starting from the new_offset=999,900 to the end of the data stream in each partition. By repeating this process, we are creating continuous sliding windows of length 1000 (from the 10 partitions). This process is sub-second, and concurrent, with Viper visualization and Kafka over HTTPS!
If the last offset in the data stream is 1,000,000, then rolling it back by 100 offsets is going to be the new_offset=1,000,000-100 = 999,900 in each partition. Kafka will grab the data starting from offset 999,900 to the end of the data stream in each partition. By repeating this process, we are creating continuous sliding windows of length 1000 (from the 10 partitions). This process is sub-second, and concurrent, with Viper visualization and Kafka over HTTPS!
Below image shows a real-time big data visualization of Predictions that are stored in a Kafka topic with 30 partitions. Viper visualization is listening for connections on IP: 127.0.0.1 and port: 8003 (https port). My rollbackoffset=4, offset=-1, append=0 and groupid=GroupId-financegroup2.
This particular topic has 50 gigs of continuously growing data. The total number of data pushed to my Google Chrome browser is approximately: 126 (4X30=120, note because data is flowing in real-time, the true number will be approximately around this number.) Whenever there is an update to this topic, Viper visualization almost instantly pushes the new data to my browser no matter how big the data gets in the topic and in the 10 partitions.
Big data streaming visualization is a critical component of Transactional Machine Learning which opens up many exciting visualization opportunities with deeper insights from very large, fast flowing, data sets.
Till next time...