The Data Cook: Zookeeper, its Importance and Installation

Introduction

Apache ZooKeeper is a distributed coordination service for managing configuration, synchronization, and naming services in large-scale distributed systems. It helps maintain consistent and fault-tolerant cluster management.

Key Features of ZooKeeper:

✅ Leader Election
✅ Configuration Management
✅ Distributed Synchronization
✅ Naming Service
✅ Failure Detection

ZooKeeper in HBase

HBase is a distributed NoSQL database that relies on ZooKeeper for coordination.

🔹 HBase uses ZooKeeper for:
✅ Master Election – Ensures only one HMaster is active.
✅ RegionServer Coordination – Tracks active RegionServers.
✅ Failure Detection – Detects RegionServer crashes and triggers reassignments.
✅ Metadata Storage – Stores HBase root metadata, such as table structure and regions.

👉 Without ZooKeeper: HBase cannot assign or manage regions effectively, leading to inconsistency and failure.

ZooKeeper in Kafka

Kafka is a distributed event streaming platform that requires ZooKeeper to manage brokers.

🔹 Kafka uses ZooKeeper for:
✅ Broker Coordination – Tracks active brokers in the cluster.
✅ Topic Management – Stores metadata about topics, partitions, and replicas.
✅ Leader Election – Selects the leader for each partition.
✅ Consumer Group Management – Keeps track of consumer offsets.

👉 Without ZooKeeper: Kafka brokers cannot coordinate, leading to potential data loss or unavailability.

ZooKeeper in Sqoop

Sqoop is a tool for importing and exporting data between HDFS and RDBMS.

🔹 Sqoop uses ZooKeeper for:
✅ Job Coordination – When used with Sqoop Metastore, ZooKeeper helps track job status.
✅ Fault Tolerance – Ensures jobs resume correctly if interrupted.
✅ Load Balancing – Helps manage parallel data transfer across multiple nodes.

👉 Without ZooKeeper: Distributed Sqoop jobs might fail due to lack of synchronization.

ZooKeeper Installation

Step 1: Install Java

Zookeeper runs on Java. Please ensure Java, preferably java 8 is installed.

sudo apt update
sudo apt install default-jdk

java --version

Step 2: Create a Dedicate user for Zookeeper

Create a dedicate user for for security and management. You can use the same user we have created (hdoop) for building the Hadoop ecosystem or a new one following the steps mentioned in the Hadoop installation page.

Step 3: Download and Install Zookeeper

You need to download and install a ZooKeeper version compatible with the Hadoop ecosystem. I have Hadoop 3.3.0 in my VM, the compatible HBase for this version of Hadoop is, HBase 2.3.0 and the compatible Zookeeper version for HBase 2.3.0, is ZooKeeper 3.5.9, which is also compatible with the Kafka 2.3.0 I have in my system.

Navigate to the zookeeper archive page and copy the link as shown below and download the apache-zookeeper-3.5.9-bin.tar.gz using the wget command.

wget https://archive.apache.org/dist/zookeeper/zookeeper-3.5.9/apache-zookeeper-3.5.9-bin.tar.gz

Extract the downloaded tar ball:

sudo tar -xzf apache-zookeeper-*.tar.gz

Create a Symlink:

ln -s /home/hdoop/apache-zookeeper-3.5.9-bin /home/hdoop/zookeeper

Ensure hdoop (or the new user created in step 2) fully owns the zookeeper folder:

sudo chown -R hdoop:hdoop /home/hdoop/zookeeper

sudo chown -R hdoop:hdoop /home/hdoop/apache-zookeeper-3.5.9-bin

Step 4: Setup the Zookeeper Data Directory

Create a directory where the zookeeper will store the data, ensure the user (hdoop or the one created in step 2) fully owns it.

sudo mkdir -p /home/hdoop/zookeeper/data

sudo chown hdoop:hdoop /home/hdoop/zookeeper/data

Step 5: Configure ZooKeeper

Create a configuration file by using the provided sample in config folder.

cp /home/hdoop/zookeeper/conf/zoo_sample.cfg /home/hdoop/zookeeper/conf/zoo.cfg

nano /home/hdoop/zookeeper/conf/zoo.cfg

Add or modify the following basic configuration:

tickTime=2000
initLimit=10
syncLimit=5
dataDir=/home/hdoop/zookeeper/data
clientPort=2181
maxClientCnxns=60

Save and exit the file (ctrl+x, Y, enter)

6. Perform System Service Setup

Create a systemd service file for ZooKeeper:

sudo nano /etc/systemd/system/zookeeper.service

Copy Paste the below content into this file, then save and exit (ctrl + x, Y, enter)

[Unit]
Description=Apache ZooKeeper Service
After=network.target
[Service]
Type=forking
User=hdoop
Group=hdoop
ExecStart=/home/hdoop/zookeeper/bin/zkServer.sh start /home/hdoop/zookeeper/conf/zoo.cfg
ExecStop=/home/hdoop/zookeeper/bin/zkServer.sh stop
Restart=always
WorkingDirectory=/home/hdoop/zookeeper
#PIDFile=/home/hdoop/zookeeper/zookeeper_server.pid
[Install]
WantedBy=multi-user.target

This allows ZooKeeper to be managed by systemd, enabling automatic start on power on.

Reload systemd to recognize the newly created ZooKeeper Serice.

sudo systemctl daemon-reload

Start and enable ZooKeeper Service:

sudo systemctl start zookeeper

sudo systemctl enable zookeeper

Check the Status of the ZooKeeper:

sudo systemctl status zookeeper

Step 3: Connect to ZooKeeper Service

Check, if you can connect to the ZooKeeper Server using the below command:

zookeeper/bin/zkCli.sh -server 127.0.0.1:2181

On successful connection, you should see something like the below picture:

Perform an ls command at the ZooKeeper server command prompt:

ls /

You should notice below entries as shown in the picture:

The Data Cook

Sunday, 9 February 2025

Zookeeper, its Importance and Installation