Bring up Zookeeper+Kafka cluster

This is for self reference. You may find it useful too. Objective is to deploy a kafka + zookeeper cluster in a dirty way.

This is going to be on AWS. Assumes you can launch an ec2 instance with latest amazonlinux AMI.

Quick intro of kafka and zookeeper:

  • Kafka is an event streaming platform capable of handling trillions of events a day. It was initially a messaging queue, now evolved to a lot of other features.
  • Zookeeper is distributed key-value store. It helps in managing configurations in distributed systems and comes with a lot of features.

Zookeeper is a must for Kafka. Zookeeper is usually a cluster of 3 nodes or any odd number of nodes. Odd number is to have leader elections and fail-over.

Kafka cluster has brokers. Each node is a kafka broker. Brokers store topics and partitions, with messages in them. Multiple consumers can read message and process them.

In the best interests, kafka broker with St1 volumes on AWS works well as most of the ingestion is happening serially.

ZOOKEEPER:

sudo yum install java-1.8.0-openjdk
wget http://archive.apache.org/dist/zookeeper/zookeeper-3.4.8/zookeeper-3.4.8.tar.gz
wget http://archive.apache.org/dist/zookeeper/zookeeper-3.4.8/zookeeper-3.4.8.tar.gz.md5
md5sum zookeeper-3.4.8.tar.gz.md5
sudo tar xvzf ~/zookeeper-3.4.8.tar.gz /opt/
sudo useradd zookeeper
sudo chown -R zookeeper. /opt/zookeeper-3.4.8/
sudo ln -s /opt/zookeeper-3.4.8 /opt/zookeeper
sudo chown -R zookeeper. /opt/zookeeper
sudo mkdir /var/lib/zookeeper
sudo chown zookeeper. /var/lib/zookeeper
sudo cp /opt/zookeeper/conf/zoo_sample.cfg /opt/zookeeper/conf/zoo.cfg
sudo mkdir -p /vol/zookeeper/data /vol/zookeeper/logs
sudo chown -R zookeeper /vol/
sudo systemctl status zookeeper.service
File contents of zoo.cfg:
dataLogDir=/vol/zookeeper/logs
# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just
# example sakes.
dataDir=/vol/zookeeper/data
# the port at which the clients will connect
clientPort=2181
# the maximum number of client connections.
# increase this if you need to handle more clients
#maxClientCnxns=60

#
# Be sure to read the maintenance section of the
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
autopurge.purgeInterval=1
File contents of /etc/systemd/system/zookeeper.service:
[Unit]
Description=Apache Zookeeper server
Documentation=http://zookeeper.apache.org
Requires=network.target remote-fs.target
After=network.target remote-fs.target

[Service]
Type=forking
User=zookeeper
Group=zookeeper
ExecStart=/opt/zookeeper/bin/zkServer.sh start
ExecStop=/opt/zookeeper/bin/zkServer.sh stop
ExecReload=/opt/zookeeper/bin/zkServer.sh restart
WorkingDirectory=/var/lib/zookeeper

[Install]
WantedBy=multi-user.target

This is a standalone zookeeper instance. Create an AMI of this and use it for creating more instances.

Once you've 3 instances, give them DNS and add them in zoo.cfg as follows:

server.1=zookeeper-1.dns.name
server.2=zookeeper-2.dns.name
server.3=zookeeper-3.dns.name

Do echo 1 | sudo tee /vol/zookeeper/data/myid and similar for 2 and 3 in each server. This helps in Zookeeper quorum to uniquely identify servers. They've to be in the range of 1-255.

KAFKA

Commands:

sudo yum install java-1.8.0-openjdk
sudo wget https://archive.apache.org/dist/kafka/0.9.0.1/kafka_2.10-0.9.0.1.tgz
sudo tar xvzf ~/kafka_2.10-0.9.0.1.tgz /opt/
sudo useradd kafka
sudo chown -R kafka. /opt/kafka_2.10-0.9.0.1/
sudo ln -s /opt/kafka_2.10-0.9.0.1/ /opt/kafka
sudo chown -h kafka.  /opt/kafka

Change contents of /opt/kafka/config/server.properties and add zookeeper urls.

Write a systemctl service like above for kafka, exposes JMX at 9000 port in /etc/systemd/system/kafka.service:

[Unit]
Description=Apache Kafka server (broker)
Documentation=http://kafka.apache.org/documentation.html
Requires=network.target remote-fs.target
After=network.target remote-fs.target zookeeper.service

[Service]
Type=simple
User=kafka
Group=kafka
Environment=JAVA_HOME=/etc/alternatives/jre
Environment=JMX_PORT=9000
ExecStart=/opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties
ExecStop=/opt/kafka/bin/kafka-server-stop.sh

[Install]
WantedBy=multi-user.target