Issue
we are using Kafka version - 2.7.1. cluster includes 5 Kafka machines on Linux RHEL 7.6 version
we want to perform Kafka restart on all 5 brokers, but since the Kafka cluster is production cluster, then rolling restart should be the right way.
so we wrote bash script that do the following
- stop the kafka01 broker (by
systemctl
command) - verify no Kafka PID (process)
- start kafka01 broker
- verify kafka01 is listening to Kafka port 6667
and continue the same steps 1-4, on kafka02, kafka03, kafka04, kafka05 additionally to above steps – 1-4.
we want to add the verification about – verify no corrupted indexes appears after starting the Kafka broker (appears on the Kafka log - server.log
), before continue to the next Kafka broker
but we are not sure if this additionally step is needed
NOTE - after Kafka broker restart - usually in the server.log
we can see that Kafka tries to fix corrupted indexes (so its can take about 10min to 1 hour)
Solution
kafka-topics.sh --bootstrap-server kafka01:9092 --describe --under-replicated-partitions | wc -l
After you are restarting a broker make sure there is no under replicated partition before moving on to another broker,
For having no outage with one broker down in kafka cluster you need to make sure that all your topics are using replication.factor=3
and min.isr=2
In order not to passing the controller around (twice) you should check which broker is the kafka controller and restart it last.
zookeeper-shell.sh [ZK_IP] get /controller
You should restart it last to avoid passing the controller around , when restarting this broker the controller role would be assigned to another running broker so restarting it last would insure you are passing the controller only once
Answered By - Ran Lupovich Answer Checked By - Robin (WPSolving Admin)