How Uber compresses 86% of incoming data to decrease required storage space by 30 times?
Uber is the face of startup disruption, challenging every startup growth stories that we have seen so far. But they are also well known for their engineering endeavors, conducting millions of trips each day, flowing terabytes of data through their platform.
Now, that’s massive amount of data!
Let’s assume, each trip sends 20KB of data in a JSON format. So, if you calculate, Uber requires 20GB of storage just for 1 Million trips, which is the number of trips Uber makes in a day.
Let’s take a benchmark of 30TB and see when Uber will run of this storage space:
- Let’s says if they were conducting 1 Million monthly trips, it would be take them 3 years to consume 30TB
- Now, let’s say if they take 4 Million trips in a month, they will consume 30TB in just 11 months
- Increasing the limit to a mid-level successful location-based startup, let’s say if Uber takes 10 million trips in a month, they will consume 30TB in just under 4 months.
Let’s have a look at location based taxi startups and how many rides they conduct over a month: Lyft, Uber and Curb
The graph below well illustrates how quickly Uber will consume 30TB of data
But, Guess what?
Uber has 45 million active riders and they often take more than 70 million trips in a month. If we do the math, 30TB won’t even last a month at this scale. With their massive growth, minimizing storage space was a priority for Uber.
Uber decided to use algorithms to optimize the raw JSON files. The objective was to compress data without sacrificing the performance, and reduce the encoding and decoding time to increase system efficiency.
By leveraging 10 encoding protocols( including Hrift, Protocol Buffers, Avro, MessagePack among others) and 3 compression libraries (Snappy, zlib, Bzip2), they were able to reduce the 20KB files to 2,822 bytes by a huge 86% margin.
This not only saved space but significantly reduces the data processing time, now doing the calculations, 30TB of storage would last over 30 years compared to just one year.
Keep in mind that the compatibility of these libraries and protocols will change with depending on the database and server you choose.
Bottom Line,
Uber’s approach was based on testing and working together with various protocols and libraries to build a customized solution. With high-volume, high-velocity data coming in, it becomes extremely important for location-based startups to place data compression strategy and optimize data storage.
What has been your experience dealing with data-heavy applications?