Futex (Fast user space Mutexes)/Understanding Linux kernel series/Synchronization primitives (part 5)

Introduction

This is latest in series of Linux kernel synchronization primitives articles. We are going to talk about "Futex" (Fast user space locking/mutex) which was introduced in Linux kernel way back in 2.5.7 release as system call. Futex is special in the sense that it is not a synchronization primitives by itself but allow us to write synchronization primitive with its help.

Why we need Futex when we have other synchronization primitives that allow us to wait on lock, such as semaphores. Using pure semaphores as synchronization primitive has overhead of system call when lock is not contended. semaphores are considered heavy weight and thus incur more memory/cpu cost.

So what is Futex. In simplest term, Futex provides a very convenient way of putting process/thread in wait queue when lock in contended and wake it up later on when lock is available. If you have read earlier articles of this series, you would know that when lock is contended (i.e. 2 processes/threads wants to lock same lock primitive), one process/thread is put into wait queue, and other process/thread which has lock wakes it when its done with lock.

Before we move further, I wanted to emphasize the "user space" part of Futex. Until this article, all synchronization primitives that we have seen in this series such as RCU/spinlock/semaphore etc, locks and synchronization logic completely existed in "kernel space". This means those primitives were not generally useful for implementing "user space synchronization library" such as pthread library. Futex fills that gap. In case of futex, half of synchronization logic resides in user space, such as lock representation, acquiring lock in "non-contended" case, while other half of logic resides in kernel space which is wait/wake logic in case lock is contended.

Below is Futex system call API.

int futex(int *uaddr, int op, int val, const struct timespec *timeout, int *uaddr2, int val3);

For our discussion purpose, Only below fields are important

  1. uaddr : This is address of memory location which represent our synchronization primitive.
  2. op : This is where we tell Futex to either wait or wake.
  3. val : This is value against which value present at "uaddr" is tested during wait and number of process to wake up during wake.
  4. timeout : This is only used during wait. It specifies how much time to wait during WAIT operation.

There are 2 interesting (and most used) values for "op" argument.

  1. FUTEX_WAIT : When this op is specified, value present at "uaddr" is compared with "val" argument. If it is same, futex puts current processes to wait queue that is associated with "uaddr" (Different uaddr have different wait queue) and preempts the process (aka put process to sleep). Please note kernel figures out if 2 different virtual "uaddr" are pointing to same "physical address", and use same queue for them. This allows 2 different processes to use "shared memory" and put lock there and do Inter Process Communication or synchronization.
  2. FUTEX_WAKE : This wakes up at most "val" number of process waiting in wait queue associated with "uaddr".

Lets say we need to write a simple Mutex using Futex, it could be written as below. This example is just to demonstrate futex system call usage and not meant as production ready implementation.

#include <linux/futex.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <sys/time.h>
#include <stddef.h>


#define LOCKED   1
#define UNLOCKED 0


struct mutex 
{
  int lock;
};


int cmpxchg(int * ptr, int old, int new)
{
  int ret;
  volatile int *__ptr = (volatile int *)(ptr);
  asm volatile("lock; cmpxchgl %2,%1"
     : "=a" (ret), "+m" (*__ptr)
     : "r" (new), "0" (old)
     : "memory");


  return ret;
}


void lock(struct mutex *m)
{
  while(cmpxchg(&m->lock, UNLOCKED, LOCKED) == LOCKED)
  {
    syscall(__NR_futex, &m->lock, FUTEX_WAIT, LOCKED, NULL, NULL, 0); 
  }
}


void unlock(struct mutex *m)
{
  cmpxchg(&m->lock, LOCKED, UNLOCKED);
  syscall(__NR_futex, &m->lock, FUTEX_WAKE, 1, NULL, NULL, 0); 
}

First thing that you might notice is, instead of futex() system call, we have used syscall(). Reason being, glibc does not have wrapper function for futex system call, so we directly use syscall by providing the futex system call number and passing parameters as rest of arguments to syscall.

Why futex call was not wrapped with glibc? Futex system call is not for mere mortals, it is a specialized system call intended only for people writing user level synchronization primitive libraries (think about pthread library writers). This is explicitly mentioned in Futex man page "As these semantics involve writing nonportable assembly instructions, this in turn probably means that most users will in fact be library authors and not general application developers."

We have proved above point (need of being assembly literate) by implementing cmpxchg using assembly instruction for x86. This function compares the value of argument "old" with value present at "uaddr", and if it is same, it stores new value at that location and always return old value of uaddr if exchange was done or current value if exchange was not done.

"struct mutex" has just one field, which is "int lock".

In this example, 0 represent unlocked status and 1 represent locked status of lock.

lock() implementation is very simple, it tries to first acquire lock by putting value 1 (Locked) only if lock is current Unlocked (value 0). If this succeeds, cmpxchg would have returned old value, which is Unlocked (0), and we will not go into while loop. This means for un contended lock case, we avoid doing expensive system call.

Now 2 cases are possible if cmpxchg failed, i.e. if lock is already locked.

  1. cmpxchg fails as lock is not available, and we go inside "while loop". In this case, when we are inside futex(), lock is still not released, which means value at "uaddr" and "val=LOCKED" would match, and futex puts current process in wait queue to be woken up by another process which has lock.
  2. cmpxchg fails as lock is not available, we go inside "while loop", but meanwhile lock was released by other process (i.e. *uaddr = UNLOCKED), but our current process is unaware of it. When futex() call is executed, value at "uaddr" is UNLOCKED(0) which is not same as "val=LOCKED" and futex call return immediately with predefined error code -EWOULDBLOCK. This means we again execute cmpxchg and this time we might be able to get lock.

Now if case 1 happened, where process needing lock was put to wait queue and preempted, when unlock() is done by another process that was holding the lock, unlock() puts UNLOCKED (0) value back at uaddr and wakes up the process. We might also set it unconditionally to UNLOCKED but may have required a memory barrier. cmpxchg act as memory barrier as it is atomic operation on memory, so we avoided explicit memory barrier.

You might be wondering why we needed futex call in unlock() if there was no waiter. This is one more optimization we could do if instead of just having 2 status of lock which is UNLOCKED & LOCKED, we have 3 status UNLOCKED, LOCKED_NO_WAITER & LOCKED_WAITERS. With 3 status we should be able to disambiguate between waiter and no waiter case and avoid wake call if there are no waiters.

By looking at above program, you might have also realized that most of heavy lifting is already done by futex system call and that makes synchronization library authors life easier.

With these series of articles, we have realized that low level synchronization is very complex and vast field as this series already contains 5 articles with this subject. Writing correct system level program is infinitely more difficult because of these complex concepts that every system programer needs to master. System programmer (C/assembly/kernel) have been cursed with these complex concepts.

Application programmer (Java/Kotlin/Scala/Python) have been blessed with not needing to understand these complex things. This blessing is actually a double edge sword for application programmer as it allows Bozos to join their leagues and ultimately degrades the application programming space.

This series is dedicated to under appreciated system programmers.

Thanks for reading this article, and I hope that you have enjoyed this series until now.

To view or add a comment, sign in

More articles by Rajeev M.

  • Bitcoin Mania

    At this point I am guessing that everyone must have heard about Bitcoin. There are two separate concepts that get mixed…

  • Where is everyone

    Fermi's Paradox "Where is everyone?", Enrico Fermi asked this question in 1950. Question was basically about paradox…

    1 Comment
  • The greatest story ever told (We are not special)

    1755 Lisbon, Portugal, "1 November, All Saints Day", Holiest of holy day. People are gathering in churches to celebrate…

    1 Comment
  • The Golden Age of Programming

    Introduction Imagine you are a programmer in 90s and working on a first person shooter game where you need to…

    1 Comment
  • How buffer overflow attacks work

    Real Programmers always confuse Christmas and Halloween because Oct31 == Dec25. --- Andrew Rutherford Purpose of this…

  • Why God is a bad theory to explain Universe

    In this article, my intent is not to hurt anyone's feelings and I am not targeting any particular form of God or…

  • Einstein's special theory of relativity (part 3)

    Coordinate Transformation In previous articles, we have learned that both space and time is kind of personal to…

  • Einstein's special theory of relativity (part 2)

    Lorentz length contraction In part 1 of this article, we discussed about the time dilation. One of the effects of that…

  • Einstein's Special Theory of Relativity (part 1)

    Introduction In this article, I am trying to give high level idea of special theory of relativity. Some of the concepts…

  • How buffer overflow attacks work

    Real Programmers always confuse Christmas and Halloween because Oct31 == Dec25. --- Andrew Rutherford Purpose of this…

    1 Comment

Others also viewed

Explore content categories