This is a technical post on how we can manage our infrastructure better, using Chef. It is part of a series where I talk about some of the new technologies I use in a large scale cloud application.
Chef is an open source infrastructure management tool. Managing your infrastructure responsibly is a huge part of making your build environment reliable, fast and honest. That last point, honest, might have thrown you. Honesty is an important part of any endeavour, but it is crucial when we talk about a build environment. Having your infrastructure report reliable information is key to success, and mandatory when debugging spurious errors. You never want to be in the situation where you’re unsure if a code change or an environment configuration is at fault for breaking a build. After all, if we can’t trust our infrastructure, we’re already goosed!
As part of an earlier investigation into the suitability of using one of the many available infrastructure management tools, we set up a Chef server; a centrally operated infrastructure management tool. This tool can even let us version control what applications are installed on a machine!
“Chef turns infrastructure into code. With Chef, you can automate how you build, deploy, and manage your infrastructure. Your infrastructure becomes as version-able, testable, and repeatable as application code.” – https://www.chef.io/
Chef sits between the application layer and the hardware itself. We get “blank” Windows or Linux machines, with only some fundamental software installed by default. We can use chef to set up users, credentials, a standard set of software prerequisites, and we can even configure them all the same way. Within a short number of days, we had come to understand the benefits of having such a useful tool in our inventory. When debugging through a problem while developing an application, it is hugely valuable to be able to say “we know it’s not a problem with the environment”. This can save days of playing Snakes and Ladders with a builder trying to find the nuances with that particular environment, and how it has been configured. This is known as the “work of art” server. This is a nonsense. Works of art are hard to reproduce, and even to the trained eye, can be difficult to spot the difference between one and another. We can say with certainty that anything running on our infrastructure will behave the same way, regardless of if it’s running on virtualmachine381 or virtualmachine384. That being said, the best way to describe the benefits of Chef is to recall a narrowly avoided infrastructure nightmare.
A few weeks ago, one of the team spotted that one of the passwords were hours away from expiry on all of our machines! This would have lead to business-stopping downtime on all of our servers. At the time, we were managing about 10 production machines, and around the same number of test or personal machines. This put the number up at about 20 machines. To change the passwords on 20 machines pre-chef would require a significant number of people downing tools for a day and (on each machine) logging on, entering the old password, entering the new password twice, logging out, and then logging back in to make sure it all worked. This is a highly error prone series of steps. Despite there being only four steps, it’s easy for something to go wrong (like forgetting which exact builders are currently being used in the farm). Because the chance of error is so high, it also takes an unknowable amount of time, which is unfortunate when the task is time sensitive! In practice, with all going well, it would take about 10 minutes per machine. Performing the exact same steps on twenty machines, turns a ten minute job into a full day of work (and potential to be locked out of our machines, or to lock an account, or even both!). Sadness.
In the chef world, it’s easy to cook up a new password. Simply add the new password to what’s known as a data-bag, which is an encrypted file for sensitive information like passwords. And send out an update request to each node in the infrastructure. That’s exactly what we did. We made a new, secure password, and cheffed it onto each of our nodes. Crisis averted! Time taken: 20 minutes total. In the future, we anticipate we will be managing a much more diverse set of minions on our build farm, and manually performing a password change is not something we intend on doing more than once. Duplicating that effort is wasteful. This is a scale-able solution.
We don’t just manage passwords with chef; we make sure all our infrastructure is made of the same stuff. This is part of a gradual shift of moving infrastructure away from being a work of art, into being a repeatable, scalable, familiar compute grid. The more of our stack we can capture in code, the less time will need to be spent tracking down esoteric bugs in our infrastructure. The below picture details how we (currently) ensure all of our infrastructure is using the correct version of IBM Java. This can be used to push a new version out to every node in our network, if we like.
“But configuration issues don’t happen that often”. They really, really do! Configuration issues as simple as having a part of a path with the wrong case can cause a builder to topple over. We recently experienced a strange issue where time was being incorrectly reported, say 14:03 instead of 14:07, because the NTP time servers were configured differently on two of our machines and not the others (we fixed this with Chef!). While that’s a relatively trivial example, the complexity grows quickly as soon as the build farm grows from 2-3 builders to 15-20! Managing these by hand is a nightmare, as it’s far too easy to forget the exact process and miss a step. This becomes super important when builds are distributed across different machines, or when a cluster is in a don’t-care configuration (when the code doesn’t care what builder it’s on, only that it’s on a builder, and that it’s building!). This is the goal of any mature release engineering team. To reach that goal, many obstacles must be overcome!
Our build infrastructure team is committed to producing reliable, repeatable, high-quality services so that we can all work more effectively. Part of that is achieved by employing the right tools, like Chef, to help manage the end-to-end delivery process. We don’t believe in a server being a work of art. In the event of a catastrophe each piece of our infrastructure must be reproducible in a well known, and well understood, amount of time.
Less works of art, higher quality infrastructure.