Don’t stall – multitask! Random access to main memory with cache-like performance
Imagine we could fit big data in the processor cache and access data without spending hundreds of stall cycles on data cache misses. This fantasy might well be impossible, yet cache-like performance is within reach for workloads that consist of parallel tasks. Parallel tasks can be interleaved so that, when one waits for memory, others can use the otherwise idle processor core. This form of cooperative multitasking can be implemented in software today, for instance using coroutines, i.e., functions that can suspend their execution and later resume from where they left off. Any function can become a coroutine with few code changes, enabling interleaved execution in arbitrary code paths. This way, even pointer chasing on big data can effortlessly perform as if the entire dataset resides in cache.
Data access is either sequential or random. Contrary to sequential access and despite the extensive efforts of computer architects, compiler writers, and system builders, random access to data larger than the cache has been synonymous to inefficient execution. Especially in the big data era, data processing is memory bound, and main memory accesses take approximately 200 cycles each, posing a great challenge to current processors. Hardware prefetchers are ineffective given their inability to predict irregular and data-dependent memory accesses, while out-of-order execution cannot hide long latencies, as the considered instructions tend to depend on each other – code supposedly resembles a coherent train of thought after all. Thus, the processor stalls, wasting cycles instead of executing useful work; work that amounts up to 5 instructions per cycle in modern Intel microarchitectures.
Big data workloads typically contain enough other work to execute while fetching data. Together with large datasets, these workloads exhibit a high degree of data- and task-parallelism. Although instructions within each task depend on each other, they are independent across parallel tasks. To hide memory latency, we can exploit these independent instructions through interleaved task execution: upon a main memory access in one task, we can ‘context switch’ to other tasks and thereby keep the processor busy with instructions, only returning back to the first task when data is in cache. Such task interleaving can in principle hide any latency given (a) enough parallel work and (b) a fast-enough task switch mechanism [1]. While we have been hiding I/O latency with the help of asynchronous operations for decades now, we have no established analogue for the latency of main memory: Hardware-based interleaving in the form of simultaneous multitasking has proven ineffective as mainstream processors offer only coarse-grained control over context switching (via OS threading) and support only a limited number of hardware contexts (2 in Intel, 8 in IBM Power) which often comprise less work than necessary to completely hide memory latency. On the software side, attempts to interleave with compiler optimizations (e.g., strip mining, loop distribution), as well as manual techniques (e.g., group prefetching, asynchronous memory access chaining) can effectively hide latency, yet they either work only for limited code patterns, or increase code complexity to a prohibitive extent.
Recently, we proposed interleaving with coroutines [2], a general-purpose software technique to enable interleaved execution in task-parallel code. With very few changes that preserve the original code structure, our technique enables each function that incurs main memory access to suspend its execution upon said access, pass control to other code, and resume roughly after data has been fetched to the L1 cache. Such functions that suspend and resume execution are called coroutines and have been recently approved for the next version of the C++ language. C++ coroutines can be easily composed together, enabling interleaving in complex codepaths, where suspensions need to happen deep in a call stack, e.g., when reconstructing tuples with multiple attributes of different datatypes in a dictionary-compressed column store. Given each suspension-resumption pair has the cost equivalent to 2 function calls, meaningful amount of other work can be executed between suspension and resumption, enabling execution with L3-like performance. Hiding access latency to DRAM and NVM in both local and remote NUMA nodes, interleaving with coroutines offers speedups up to 8x for various database operations, such as index joins between relation tables [2,3,4], MULTIGET operations in key-value stores [5], and tuple reconstruction in column stores [4].
Interleaving with coroutines has a wide application scope, is easy to implement, and offers remarkable speedups. Random accesses no longer imply memory stalls, enabling us to rethink our data structures – with more indirection and less partitioning. Further, opportunities arise for hardware-software co-design: Can we ask the cache if a piece of data is there, so that we suspend only when necessary? Even better, can we eliminate the cost of coroutine suspension/resumption with instructions that swap the entire register file and/or hardware contexts managed by the application? Even if answering these questions takes time, plenty of important use cases await to be interleaved as soon as coroutine support becomes mainstream.
References
[1] James Laudon, Anoop Gupta, and Mark Horowitz. 1994. Interleaving: A Multithreading Technique Targeting Multiprocessors and Workstations. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VI). ACM, New York, NY, USA, 308–318.
[2] Georgios Psaropoulos, Thomas Legler, Norman May, and Anastasia Ailamaki. 2017. Interleaving with Coroutines: A Practical Approach for Robust Index Joins. PVLDB 11, 2 (October 2017), 230–242.
[3] Georgios Psaropoulos, Thomas Legler, Norman May, and Anastasia Ailamaki. 2018. Interleaving with coroutines: a systematic and practical approach to hide memory latency in index joins. The VLDB Journal (14 Dec 2018).
[4] Georgios Psaropoulos, Ismail Oukid, Thomas Legler, Norman May, and Anastasia Ailamaki. 2019. Bridging the latency gap between NVM and DRAM for latency-bound operations. DAMON’19 (1 July 2019).
[5] Christopher Jonathan, Umar Farooq Minhas, James Hunter, Justin Levandoski, and Gor Nishanov. 2018. Exploiting Coroutines to Attack the "Killer Nanoseconds". PVLDB 11, 11 (July 2018), 1702–1714.
Hermano Lustosa
Good article. Cimple paper at PACT 2018 is very much related to what you are describing in the blog post. (https://dl.acm.org/citation.cfm?id=3243185)