So, there is case, every once in a few days the kafka cluster dies and along dies the whole application resulting in revenue loss for the company.
Starting there, I would like to describe how we approached the unknown issue, assumptions, failure, trial and error,
until we found the root cause: a known bug in the respective version of the famous distributed software.
Of course, every software has bugs and hitting a major one is not such uncommon but more than that,
what is actually important is the actual lessons learnt during the process:
Major takeaway of this talk: tackle your incidents as a way to understand more about your systems (both technical systems: infra, code, tools AND non-technical systems: teams, workflows, procedures, practices) and design them better.