Implementing High Availability using Zookeeper's lock mechanism with Apache Curator API

Implementing High Availability using Zookeeper's lock mechanism with Apache Curator API

I am currently working on developing and enhancing a Big Data based reporting system responsible for performing analytics and generating custom reports that lets users slice and dice data the way they want. The heart of this system is called Orchestration node which is responsible for running all processing jobs. The jobs are scheduled and triggered via Cron. However it also has become the single point of failure for entire application.

The idea is to introduce another Orchestration node such that at a given time only Node is in Active state. The other node should remain Passive as long as the Active one is up and running. Under no circumstances should both the nodes become Active otherwise they would mess up the system.

Solution:

A lot of googling and reading forums suggested using Zookeeper to solve this problem. Zookeeper is a highly distributed, reliable coordination and state management system but not easy to use. So we used Apache Curator alongside which is a Java/JVM client library for Apache ZooKeeper that includes a highlevel API framework and utilities to make using Zookeeper much easier and more reliable. Curator provides APIs to acquire lock on zookeeper which can then be used to take control and do processing. In addition RCron was used which ensures that Cron jobs are triggered from only the node that is currently in Active state.

A new orchestration node was added to existing system (in addition to existing one) and the custom solution was deployed on both the nodes. The solution tries to take lock over a fixed Zookeeper znode. If lock is acquired then it changes its state to Active. RCron detects this change and starts running jobs from this node. The solution also keeps on checking at a regular interval to verify if connection to Zookeeper is still on or not. The other node keeps on waiting indefinitely for lock to be released to that it can acquire it and become Active.

The node which acquires lock will release it if it goes down, or goes out of network, or the process which has taken lock gets killed accidentally. To deal if network glitch, we keep on checking if connection to Zookeeper is still active and release it as soon as connection is lost. To handle the node failure issue, the solution is triggered via cron. Every time cron is triggered it checks if the job is still running. If found running it does nothing otherwise it changes its state to Passive and waits for lock to be released. Since the job is triggered via cron (say every 10 secs) if it gets accidentally killed also the cron will retrigger it in some time.

This way we were able to achieve at a reliable solution to implement High Availability in our system. I hope you find it of use. Suggestion and improvement ideas are welcome.




To view or add a comment, sign in

More articles by Pushkin Gupta

Others also viewed

Explore content categories