High Availability of Processing Elements in Computer Systems

The high availability of data (specifically for distributed systems in web-scale organizations) is a hot research topic, and organizations are currently using multiple schemes for the same. Certainly, the data is the black gold for these organizations so making appropriate investments makes perfect sense.

One of the less discussed topics is the high availability of processing elements. Most of the literature available advocates bringing up a new instance of processing element as soon as the current working one becomes unavailable. Through this article, I am trying to summarize some of the most commonly used methodologies I have encountered during my stint as a software developer.

  1. A (hot) standby processing element exists for each active processing element.
  2. A smaller pool of (hot) standby processing elements is available for a large pool of active processing elements.
  3. Multiple active processing elements act as standby processing elements for each other.
  4. Starting a replacement processing element once the active processing element becomes unavailable.

Each method above has a tradeoff associated with it. From (1) to (4), the time to take over the active processing increases but the resource wastage (both CPU and memory) reduces.

A standby processing element taking over a previously active processing element sounds nice but what is it that a new processing element is taking over?

The answer is the “data”, the previously active processing element was working on. In some cases, it is the (stable) state of objects and in other (transaction-based systems), it is the list of (partially and/or completely) processed transactions. Whatever it is, this data needs to be available to the standby processing element when it takes over.

There are multiple ways in which this can be achieved. Following are a few alternatives:

  1. The standby processing element is notified of this data by the active element at regular intervals or on particular events.
  2. A shared memory facility exists where the active element stores the data and the standby element takes ownership of it after the active becomes unavailable.
  3. A common database(in the same or different physical node) exists where the active element stores the data that the standby element queries once the active element is no longer available.

Again, each method above has its positives and negatives. Note that alternative (2) is strictly applicable if both active and standby processing elements are expected to be on the same physical node but alternative (1) and (3) can be used on multi-node setups. Of course, the nodes in the multi-node setup could be even located on different continents.

The notification of the unavailability of the active processing element to the standby processing element is also important and something needs to be considered on equal footing.

Common schemes used are

  1. A Heartbeat mechanism between the active and standby elements or
  2. A mediator that exclusively monitors the active processing element (through heartbeats or otherwise) and informs the standby processing elements accordingly. Sometimes the mediator itself can be made responsible for spawning the standby processing elements as per the specified need (Kubernetes config!).

In conclusion, we have a multitude of options available to cater to the high availability of an application, each with its tradeoffs. For each use case, we could employ one or a mix of these to achieve the required outcome.

Please provide your valuable feedback and comments if something needs to be added or modified in these methodologies.

To view or add a comment, sign in

More articles by fakharu .

Others also viewed

Explore content categories