The Phoenix Project and the start of the DevOps movement

The first book I listened to in 2019 was "The Phoenix Project", by Gene Kim, Kevin Behr, and George Spafford. It is a book about good software development practices, but written as a novel instead of a non-fiction advice-giving book. I think that helped keep the book entertaining.

The book tells the story of Bill Palmer, a middle manager who is promoted to head of IT of his company, and needs to take the company out of a slump. The company sells auto parts, and uses software for its in-store inventory and purchasing systems, and for its online sales. The issues the software teams face are familiar to many software engineers: delayed releases, releases causing unexpected breakages, unplanned work to fix outages which takes time out of development work, etc.

The protagonist meets Eric, a board member of the company who becomes his mentor, and keeps pointing out analogies between work on a factory floor and IT / software development work. On a factory floor, people know how much time each task takes, and organize the flow of work in a way that keeps throughput high. The slowest component is the bottleneck. Optimizing work anywhere in the pipeline besides the bottleneck is wasted effort. If you optimize before the bottleneck, inventory keeps piling up at the bottleneck, which can't keep up. If you optimize after the bottleneck, the units just remain idle waiting for work from the bottleneck, whose throughput has stayed the same. Similarly, in a software organization, to complete projects on time, you need to improve the slowest part of the process. For example, if you have one go-to person X that knows the system inside and out, people keep going to X to fix their problems, and X becomes the bottleneck. In addition, X's knowledge keeps growing as they're solving all the issues, and the problem becomes worse. The solution is to document knowledge about the system, and have X train more people to become experts, so that not everything needs to go through X.

In a factory, they also try to minimize unplanned work. By definition, unplanned work takes time out of planned work, so it pushes goals back. To minimize unplanned work, you need robustness in the system, so that the number of outages is small, and when outages happen they get resolved quickly. To increase robustness, you need predictability and not surprises.

A typical way that software development is done in many companies is that there is the software department, which produces the code, then the code goes to a separate department for testing and quality control, and finally gets deployed by the IT department. The release cycle is slow (on the order of months) and has many problems:

  • Code is deployed a long time after it is written, and problems are discovered late.
  • When problems are discovered late, it is hard to even bisect and find the root cause of an issue. Also, the code is no longer fresh in the mind of the programmer, so it takes them more time to fix the problem.
  • Developers work in an environment that is typically much different from the production environment, in undocumented ways, which increases the probability of issues at deployment.
  • Developers don't deal with testing and deployment, so they have a poor understanding of what makes solid code; they only think about the features, not the quality.
  • Long release cycles make it hard for the company to respond to the market, and by the time a feature is launched it may no longer be needed.

The solution proposed for this is to abandon these very large and slow transfers of work from one department to another. Instead, there is close collaboration between the people who write, test, and deploy the code, it can even be the same people who do all three. Code moves from the source tree to production in small and quick increments, releasing every day, even multiple times a day. In order to do this, an organization needs automation for testing and deploying code, so that it is a push-button process.

The theory is that this process not only increases speed of deployment, but also robustness, because small deploys make it easy to find and fix problems when they arise. In addition, developers get quick feedback about the impact of their code, and a quick feedback loop improves learning. Waiting months for code to be deployed is bad for learning.

I'm leaving out a lot of details here, but this is the gist. (The book also talks about Lean manufacturing, Kanban boards, and other buzzwords and methodologies.) The book was written in 2013, and was an early proponent of what is now called DevOps. DevOps is the movement that advocates for a single department that writes the code (Dev) and deploys it (Ops), and for continuous testing and integration, and quick release cycles. It started with a talk at the 2009 Velocity conference called "10 deploys a day". The presenters talked about software development at Flickr, but apparently other internet companies, including Google and Facebook, were using similar methods at the time. But the vast majority of software was written using the old-school, slow-releases approach, and the talk made a huge impression. After that, a DevOps conference was started the following year, and DevOps practices have been gaining momentum since.

Personally, I agree with the basic recommendations of DevOps. At Google, we use automation in testing, and quick releases, and both help enormously. In addition, it seems to me good advice to try to measure the time tasks take, and try to find the bottleneck, and try to have decent estimates of work completion. It is a common problem in software development that time estimates are way off, so any quantitative methods that can help us become more methodical are welcome.


Comments