Configuration Driftโ€Šโ€”โ€ŠTech or Culture?

Opinion on how to handle configuration drift and identifying the problems.

As an important aspect of devops, I often fall into trap of classifying problem into tech or culture. In the forest of all principles and industry standards, I firmly believe in these two:

  1. Tools donโ€™t solve problems.
  2. Root cause of any problem isnโ€™t a human. It simply canโ€™t be.

I was analyzing how we ended up with configuration drift in various parts of infrastructure. With the advent of cloud, we certainly live in a day where we can do a complete immutable infrastructure.What is configuration drift and why does it happen in immutable infrastructure?The drift between intended and the actual configuration, largely occurs in mutable infrastructure where changes are in place. In theory, config drift shouldn't happen in immutable infra where we always replace servers. However, despite management tools like Chef/Ansible or Terraform, etc. solving this problem requires a cultural shift.

Config management is no good if you let users ssh into servers.

In practical scenarios, users want to find bugs and fix them earliest. They're urged to make it quicker by simply running a command by logging in to server, and keep the changes in Chef for later. Considering this scenario, the change certainly fixed the issue. Change in Chef for later is where cultural aspect comes. How do we incentivize developer to fix it in the root while production is stable with a single command? What motivates them?

Tech can solve this problem to an extent by enforcing guidelines or giving out warning like every 4 hours and the config files are reset. It boils down to how config management is perceived culturally. It's really annoying for anyone to write code which would've been a README doc otherwise. Talking of infrastructure, changing a firewall rule to fix an issue takes seconds while incorporating the changes in something like Terraform may take longer. Is it justified? When is it and when is it not?

As we're living largely in declarative programming zone where we write code for "what to do" rather than "how to do", cultural perception of this kind of programming plays a huge role in making systems stable. Technical tools enable us to think in a direction to take advantage of what cloud provides.

Often, we all end up in unanimous decision when we're discussing stateless micro-services. But databases divide us. Databases are complex and it's hard to classify them as immutable. Updates to database like a small patch (which might require a simple package update) isn't justified to replace the server from a whole new image while it's perceived as a harmless operation in terms of applications. However, sometimes, it makes sense to replace an instance.

More or less, configuration drift is a problem. We solve it with a hybrid of tech and culture. Culture isn't unified in any two companies or timeframes. It's influenced by several factors and hence, the solution you've applied in a firm A doesn't necessarily work in firm B.