Futex (Fast user space Mutexes)/Understanding Linux kernel series/Synchronization primitives (part 5)

Rajeev M.

Published May 31, 2019

Introduction

This is latest in series of Linux kernel synchronization primitives articles. We are going to talk about "Futex" (Fast user space locking/mutex) which was introduced in Linux kernel way back in 2.5.7 release as system call. Futex is special in the sense that it is not a synchronization primitives by itself but allow us to write synchronization primitive with its help.

Why we need Futex when we have other synchronization primitives that allow us to wait on lock, such as semaphores. Using pure semaphores as synchronization primitive has overhead of system call when lock is not contended. semaphores are considered heavy weight and thus incur more memory/cpu cost.

So what is Futex. In simplest term, Futex provides a very convenient way of putting process/thread in wait queue when lock in contended and wake it up later on when lock is available. If you have read earlier articles of this series, you would know that when lock is contended (i.e. 2 processes/threads wants to lock same lock primitive), one process/thread is put into wait queue, and other process/thread which has lock wakes it when its done with lock.

Before we move further, I wanted to emphasize the "user space" part of Futex. Until this article, all synchronization primitives that we have seen in this series such as RCU/spinlock/semaphore etc, locks and synchronization logic completely existed in "kernel space". This means those primitives were not generally useful for implementing "user space synchronization library" such as pthread library. Futex fills that gap. In case of futex, half of synchronization logic resides in user space, such as lock representation, acquiring lock in "non-contended" case, while other half of logic resides in kernel space which is wait/wake logic in case lock is contended.

Below is Futex system call API.

int futex(int *uaddr, int op, int val, const struct timespec *timeout, int *uaddr2, int val3);

For our discussion purpose, Only below fields are important

uaddr : This is address of memory location which represent our synchronization primitive.
op : This is where we tell Futex to either wait or wake.
val : This is value against which value present at "uaddr" is tested during wait and number of process to wake up during wake.
timeout : This is only used during wait. It specifies how much time to wait during WAIT operation.

There are 2 interesting (and most used) values for "op" argument.

FUTEX_WAIT : When this op is specified, value present at "uaddr" is compared with "val" argument. If it is same, futex puts current processes to wait queue that is associated with "uaddr" (Different uaddr have different wait queue) and preempts the process (aka put process to sleep). Please note kernel figures out if 2 different virtual "uaddr" are pointing to same "physical address", and use same queue for them. This allows 2 different processes to use "shared memory" and put lock there and do Inter Process Communication or synchronization.
FUTEX_WAKE : This wakes up at most "val" number of process waiting in wait queue associated with "uaddr".

Lets say we need to write a simple Mutex using Futex, it could be written as below. This example is just to demonstrate futex system call usage and not meant as production ready implementation.

#include <linux/futex.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <sys/time.h>
#include <stddef.h>


#define LOCKED   1
#define UNLOCKED 0


struct mutex 
{
  int lock;
};


int cmpxchg(int * ptr, int old, int new)
{
  int ret;
  volatile int *__ptr = (volatile int *)(ptr);
  asm volatile("lock; cmpxchgl %2,%1"
     : "=a" (ret), "+m" (*__ptr)
     : "r" (new), "0" (old)
     : "memory");


  return ret;
}


void lock(struct mutex *m)
{
  while(cmpxchg(&m->lock, UNLOCKED, LOCKED) == LOCKED)
  {
    syscall(__NR_futex, &m->lock, FUTEX_WAIT, LOCKED, NULL, NULL, 0); 
  }
}


void unlock(struct mutex *m)
{
  cmpxchg(&m->lock, LOCKED, UNLOCKED);
  syscall(__NR_futex, &m->lock, FUTEX_WAKE, 1, NULL, NULL, 0); 
}

First thing that you might notice is, instead of futex() system call, we have used syscall(). Reason being, glibc does not have wrapper function for futex system call, so we directly use syscall by providing the futex system call number and passing parameters as rest of arguments to syscall.

Why futex call was not wrapped with glibc? Futex system call is not for mere mortals, it is a specialized system call intended only for people writing user level synchronization primitive libraries (think about pthread library writers). This is explicitly mentioned in Futex man page "As these semantics involve writing nonportable assembly instructions, this in turn probably means that most users will in fact be library authors and not general application developers."

We have proved above point (need of being assembly literate) by implementing cmpxchg using assembly instruction for x86. This function compares the value of argument "old" with value present at "uaddr", and if it is same, it stores new value at that location and always return old value of uaddr if exchange was done or current value if exchange was not done.

"struct mutex" has just one field, which is "int lock".

In this example, 0 represent unlocked status and 1 represent locked status of lock.

lock() implementation is very simple, it tries to first acquire lock by putting value 1 (Locked) only if lock is current Unlocked (value 0). If this succeeds, cmpxchg would have returned old value, which is Unlocked (0), and we will not go into while loop. This means for un contended lock case, we avoid doing expensive system call.

Now 2 cases are possible if cmpxchg failed, i.e. if lock is already locked.

cmpxchg fails as lock is not available, and we go inside "while loop". In this case, when we are inside futex(), lock is still not released, which means value at "uaddr" and "val=LOCKED" would match, and futex puts current process in wait queue to be woken up by another process which has lock.
cmpxchg fails as lock is not available, we go inside "while loop", but meanwhile lock was released by other process (i.e. *uaddr = UNLOCKED), but our current process is unaware of it. When futex() call is executed, value at "uaddr" is UNLOCKED(0) which is not same as "val=LOCKED" and futex call return immediately with predefined error code -EWOULDBLOCK. This means we again execute cmpxchg and this time we might be able to get lock.

Now if case 1 happened, where process needing lock was put to wait queue and preempted, when unlock() is done by another process that was holding the lock, unlock() puts UNLOCKED (0) value back at uaddr and wakes up the process. We might also set it unconditionally to UNLOCKED but may have required a memory barrier. cmpxchg act as memory barrier as it is atomic operation on memory, so we avoided explicit memory barrier.

You might be wondering why we needed futex call in unlock() if there was no waiter. This is one more optimization we could do if instead of just having 2 status of lock which is UNLOCKED & LOCKED, we have 3 status UNLOCKED, LOCKED_NO_WAITER & LOCKED_WAITERS. With 3 status we should be able to disambiguate between waiter and no waiter case and avoid wake call if there are no waiters.

By looking at above program, you might have also realized that most of heavy lifting is already done by futex system call and that makes synchronization library authors life easier.

With these series of articles, we have realized that low level synchronization is very complex and vast field as this series already contains 5 articles with this subject. Writing correct system level program is infinitely more difficult because of these complex concepts that every system programer needs to master. System programmer (C/assembly/kernel) have been cursed with these complex concepts.

Application programmer (Java/Kotlin/Scala/Python) have been blessed with not needing to understand these complex things. This blessing is actually a double edge sword for application programmer as it allows Bozos to join their leagues and ultimately degrades the application programming space.

This series is dedicated to under appreciated system programmers.

Thanks for reading this article, and I hope that you have enjoyed this series until now.

To view or add a comment, sign in

Futex (Fast user space Mutexes)/Understanding Linux kernel series/Synchronization primitives (part 5)

Rajeev M.

Introduction

More articles by Rajeev M.

Others also viewed

Summary of the book Essential Linux Device Drivers

Increase or Decrease the Size of Static Partition in Linux

The Story of /proc: How Linux Turns the Kernel Into Files

Linux Kernel Configuration

Understanding struct usb_anchor: Simplifying USB Request Management in Linux

Adjusting SMP Affinity at Legacy Linux/Hardware

Linux File System Hierarchy

Kernel 7.0 is Coming: Why It Matters (and Why It Doesn't)

Zephyr's Kconfig

Explore content categories

Introduction

More articles by Rajeev M.

Bitcoin Mania

Where is everyone

The greatest story ever told (We are not special)

The Golden Age of Programming

How buffer overflow attacks work

Why God is a bad theory to explain Universe

Einstein's special theory of relativity (part 3)

Einstein's special theory of relativity (part 2)

Einstein's Special Theory of Relativity (part 1)

How buffer overflow attacks work

Others also viewed

Summary of the book Essential Linux Device Drivers

Increase or Decrease the Size of Static Partition in Linux

The Story of /proc: How Linux Turns the Kernel Into Files

Linux Kernel Configuration

Understanding struct usb_anchor: Simplifying USB Request Management in Linux

Adjusting SMP Affinity at Legacy Linux/Hardware

Linux File System Hierarchy

Kernel 7.0 is Coming: Why It Matters (and Why It Doesn't)

Zephyr's Kconfig

Explore content categories