Does Data Sorting Really Accelerate Range Queries?
Log Structured Merge tree (LSM) is widely used modern data storage engines such as RocksDB, HBase, etc. LSMs require constant data sorting to keep records stored sequentially. This sorting consumes lots of I/O bandwidth, sometimes leaving only a small portion available to the clients, and hurting SSD life at the same time. Why people pay this price?
One of the reasons for this constant sorting is to speed up range queries. A general belief is that sequentially stored data enables faster range queries. This belief is based on properties of hard disks, where sequential access can be hundreds of time faster than random access.
In SSDs, however, the performance gap between sequential access and random access narrows to only about two to ten times. In addition, LSMs do not store all the data in one sorted run. Instead, data are split into multiple sorted runs. As a result, it is unlikely that one range query can be served by one single sequential access.
Moreover, in the age of big data, any meaningful data storage solution must be distributed, with data partitioned into different machines. With hash based partitioning, the benefit of keeping data sorted in one machine diminishes further, as one range query is likely to be served by many machines in parallel. There is the option of order preserving partitioning. Unfortunately it often suffers from load balancing problems.
To make it more complex, data sorting also occupies I/O bandwidth which could have been used to serve those range queries if we had a storage engine that does not require sorting.
In summary, while it is true sequential access is still faster than random access in SSDs, in a distributed storage system, the price to pay for sorting data may not justify the diminished (if at all) benefit.
To reduce the cost of storage, we built and open sourced Exabyte Store, a Key-Value pair storage engine that is performant, memory efficient, with low write amplifications because it does not require sorting.
Another way to look at this: assuming data are stored on SSDs, while it is unclear whether this constant sorting would give you faster range queries, you are definitely paying the high price of it. Assuming you are provisioning a storage system to support 10M IOPS, with 2.5M update and the rest are read. Now assuming LSM incurs write amplification of 10, since you need similar amount of read operations in order to sort, that is a 20 times blow up of I/O operations. So you need to set aside 50M IOPS for the 2.5M IOPS update! In total you need at least hardware capacity of 57.5M IOPS to support 10M IOPS.
Very interesting, however, doesn't the value of sorting depend on the size of data in a single sorted chunk? If sorted data are in few medium-size runs, then it may be valuable to sort the data. It is ultimately a cost-benefit trade-off.