A practical example of the leader election in distributed systems

Danil Tolonbekov
5 min readApr 10, 2021

Imagine that you have built a great application and the number of users is growing rapidly. As a consequence, you are adding more and more nodes to be able to serve your users and your system becomes distributed. The individual computers working together in such groups operate concurrently and allow the whole system to keep working if one or some of them fail. There are many challenges raise. One of the challenges of a distributed system is consistency. The term consistency is from Latin means “standing together” or also “stopping together”. In general, consistency describes relationships between items that are somehow connected. When considering the consistency of data, a consistent state requires that all relationships between data items and replicas are as they should be, i.e., that the data representation is correct. A system is in a consistent state if all replicas are identical and the ordering guarantees of the specific consistency model are not violated.

One of the techniques to deal with that problem is to use a leader election process. The main idea is to choose one node as a leader and process operations through that sever. In this case, all other nodes will be followers. Let’s look at the following example: we have 3 processes and you want to send a WRITE request to the system. This WRITE request must be applied to all nodes. Otherwise, the nodes very soon ended up in an inconsistent state. To avoid that we can choose a leader and only the leader node accepts WRITE request, then propagate them to follower nodes. On the other hand, READ can be handled by the leader and non-leader nodes. This architecture, “only a leader handles WRITE requests” is used in Apache Zookeeper.

Apache Zookeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Some of the distributed systems coordination problems that Apache Zookeeper solves: managing cluster membership, configuration management, locks in distributed systems, and leader election. Apache Zookeeper solves these problems using a hierarchical namespace, much like a distributed file system. Every node in the namespace is referred to as a znode. Znodes maintain a stat structure that includes version numbers for data changes. The stat structure also has timestamps. There are 3 main types of znodes:

  • Persistent znodes are the default znodes in ZooKeeper. They will stay in the zookeeper server permanently, as long as any other clients (including the creator) leave it alone or it is removed explicitly.
  • Ephemeral znodes (or session znodes) are temporary znodes. Unlike the persistent znodes, they have been destroyed as soon as the creator client logs out of the ZooKeeper server.
  • Sequential znode is given a 10-digit number in numerical order at the end of its name. Let’s say client1 created a node1. In the ZooKeeper server, the node1 will be named like this: node0000000001. If the client1 creates another sequential znode, it would bear the next number in a sequence. So the next sequential znode will be called [znode name]0000000002.

Clients can set a watch on these znodes and get notified if any changes occur in these znodes. A watch event is a one-time trigger, sent to the client that sets the watch, which occurs when the data for which the watch was set changes.

Apache Zookeeper API has the following operations:

  • create — create a znode in a specified path and containing data
  • delete — delete the znode path
  • exists — check whether path exists
  • setData — sets the data of znode path to data
  • getData — returns the data in path
  • getChildren — returns the list of children under path

I have implemented a simple spring boot application to demonstrate one of the implementations of the leader election algorithm using Apache Zookeeper. It executes the election process and stores information about the leader and active nodes.

There are 4 main classes:

  • LeaderElectionLaunccher.java executes leader election class.

The process of leader election is as follows: each node will create an ephemeral and sequential znode at startup under “/election” persistent znode. Since znode created by the process is sequential, ZooKeeper will add a unique sequence number to its name. Once this is done, the process will fetch all the child znodes of “/election” znode and look for child znode having the smallest sequence number. If the smallest sequence number child znode is the same as znode created by this process, then the current process will declare itself leader by printing the message “I am a new leader”. However, if the process znode does not have the smallest sequence number, it will set a watch on the znode having a sequence number just smaller than its process znode.

For example, if the current process znode is “node_0000004” and other process znodes are “node_0000001”, “node_0000002”, “node_0000003”, “node_0000005” then the current process znode will be setting the watch on znode with path “node_0000003”. As soon as the watched ephemeral znode is removed by ZooKeeper due to the process being shut down, the current process gets a watch event notification. Hereafter, the current process again fetches the child znodes of “/election” and repeats the steps of checking whether it is a leader.

...public void execute() {
zooKeeperClient.connect(new NodeWatcher());
NodeInfo.setServiceName(System.getenv("SERVICE_NAME"));
log.info("Node with name: " + NodeInfo.getServiceName() + " has started");

final String electionPath = zooKeeperClient.createNode(ELECTION_PATH, NodeType.PERSISTENT);
if (electionPath == null) {
throw new IllegalStateException("Unable to create leader election path");
}

nodePath = zooKeeperClient.createNode(electionPath.concat(NODE), NodeType.EPHEMERAL_SEQUENTIAL);
if (nodePath == null) {
throw new IllegalStateException("Unable to create node in leader election path: ");
}
NodeInfo.setServiceName(System.getenv("SERVICE_NAME"));
NodeInfo.setPath(nodePath);

log.error("[Node: " + NodeInfo.getServiceName() + "] Process node created with path: " + nodePath);

attemptForLeader();
}
...
  • ZookeeperClient.java contains a method to interact with Apache ZooKeeper API.

For instance, connect and create operations:

..@Override
public void connect(Watcher watcher) {
try {
String zooKeeperUrl = zooKeeperProperties.getHost().concat(":").concat(zooKeeperProperties.getPort());
zooKeeper = new ZooKeeper(zooKeeperUrl, zooKeeperProperties.getSession(), watcher);
} catch (IOException e) {
log.error("Unable to connect to ZooKeeper");
e.printStackTrace();
}
}
@Override
public String createNode(String path, NodeType nodeType) {
String createdPath = null;
try {
final Stat stat = zooKeeper.exists(path, false);
if (stat == null) {
createdPath = zooKeeper.create(path, new byte[0], ZooDefs.Ids.OPEN_ACL_UNSAFE, toNodeType(nodeType));
} else {
createdPath = path;
}
} catch (KeeperException | InterruptedException e) {
log.error(e.getLocalizedMessage());
throw new IllegalStateException(e);
}

return createdPath;
}
...
  • NodeInfo.java class that stores information about leader path name and current node path.
@Component
@Scope(scopeName = "singleton")
public class NodeInfo {
private static String serviceName;
private static String path;
private static String leaderNodePath;

/* Getter and Setters */
}
  • NodeInfoController.java that returns leader path name and all active nodes.

The app API:

GET localhost:8081/info

Response: Current node: /election/node_0000000001 , Leader node: node_0000000000 Active nodes: [node_0000000001, node_0000000000]

GET localhost:8082/info:

Response: Current node: /election/node_0000000000 , Leader node: node_0000000000 Active nodes: [node_0000000001, node_0000000000]

Additionally, for convenience, I created the docker-compose.yml file to run two or more services concurrently. You can check out full implementation on my GitHub account.

Thank you for reading through the tutorial. In case of any feedback/questions/concerns just write me.

--

--

Danil Tolonbekov

M.Sc. in Distributed Systems Engineering, Software Engineer