A few days ago, kafka reported an error in balancing and new messages produced seem to start failing in few partitions. As usual, I logged into all kafka brokers to see what's happening. The underlying cause is data volume in one of the kafka brokers were at 100% disk usage. But, fixing it alone wasn't doing much. I saw few root volumes being at heavy disk usage, decided to remove some old logs.
As kafka lives in
/usr/share/kafka/ in our installations, I was supposed to do
rm -rf /usr/share/kafka/logs/server.log.2019-01-*. Accidentally, I did
rm -rf /usr/share/ and did enter instead of tab
Thanks to how linux works, the process in any of them isn't impacted but all the files held by process are no longer present on disk. At this point of time, since
yum works with files from
/usr/share/, package manager can't be found either.
We create kafka brokers from a base ami which has all the configuration and installation. However, live migration of kafka brokers isn't possible. I collected
/usr/share folder in gzip, from base ami and published it across brokers.
So far, there shouldn't be any impact on any of the kafka brokers. I've started to see it changing after I restored
/usr/share/. The reason being kafka process tried to create a new log file in
/usr/share/kafka/logs and discovered that filesystem doesn't have what kafka earlier pointed to. Thus there are dangling files and OS has other file with same name. Two files with same name and path doesn't mean they're same.
There are two options at this point of time. Stop the consumers, rolling restart kafka brokers. Other option is restoring files via filedescriptor. The later, I've never done on a live production server.
I started with first option, safer and known to work despite a time taking operation. Midway through restarts and rebalancing of kafka topics, I tried to restore files via filedescriptors on one of the kafka broker.
The moment it seemed to start working smooth, I've done the same on rest of the brokers.
The whole operation took about 10 hours on a Sunday night, to complete. All the decisions made, were under a certain amount of stress.
We need better practices. This is applicable everywhere irrespective of being your best, because mistakes happen in any unpredictable way. Our approach to avoid mistakes is only to create barriers for ourselves.
I've revisited some of the fundamentals of Linux/Operating System, and now it's stronger than ever.
Kafka rebalancing needs a revisit, a page from elasticsearch/mongo on how to stop shard allocation momentarily might help. However, I've not touched kafka codebase yet and I'm sure the best of their engineers are aware of this problem.
The incident made me look at things vastly different.