Recent Readings

On MicroSD Problems – The investigation of a failing batch of MicroSD cards leads to an amazing story of detective work that delves in to the world of semiconductor manufacturing, gray markets, and failure rates.

CloudClimate CDN Speed Test – A clever use of XMLHTTPRequest to time HTTP downloads of small files (64KB) to your machine from the leading CDNs and cloud providers. I’m a sucker for the pretty graphs the tool creates with the data, but beyond that I can see how this tool is useful for people evaluating CDN/cloud choices by geographic location.

Drizzle – “An Open Source Microkernel DBMS for High Performance Scale-Out Applications” are all words I know and put together in that order sound interesting. Has anyone played with this yet?

NCSA Mosaic – Now you can run Mosaic on your hexacore i7 box; the fastest AJAX is the kind that doesn’t even happen!

The Panic Status Board – I recently learned the term “information radiator” and this is a perfect example of the concept. A simple, striking visualization for what is most important to Panic for the operation of their business. It’s a network operations center for your entire business. It’s hard to see how a single board would work for a large organization, but I’d love to build one for the group I’m in at work.

DevOps, SecOps, DBAOps, NetOps – A discussion of the problem of silos inside operations organizations, and how it is important to focus on the relationships between those groups as well as relationships with people outside of Ops. As I see it, all of the *Ops initiatives are attempts to fix the brokenness in communication that traditional software shop organizational charts create; managers and up need to realize the cost in agility that comes with creating silos. On the other hand, there is a clear benefit to specialization and building service groups around specific disciplines once a company gets to a certain size. I don’t have a good solution to this problem but spend a lot of time thinking about it… however, I do know it pays to meet the people you are working with face to face, have a beer and understand what drives those groups to make the decisions they do. I sometimes wonder if doing “embedded engineering” is the right approach, with engineers from all of the silos sitting together for the duration of a cross-functional project. If anyone has any thoughts on this I’d love to hear them.

Performance Testing An Airline Reservation System

Until a few weeks ago I ran the performance and capacity testing team for the airline reservation system ITA develops. The group is under the umbrella of operations, which may seem out of place to many software shops, where typically the performance testing team exists in QA (or doesn’t exist at all until needed). We work very closely with development and QA as needed (and often, development has a dedicate set of engineers on performance work), and after doing performance work for the past few years, I’m convinced the best people for the job are the people that are skilled in development and systems administration (these are the DevOps people everyone is talking about). We’ve developed a lot of processes and tools to do our job and I think other people might find these ideas as useful as we have.

Testing Tools

At ITA we had to build many of the performance tools we use in-house because performance tools that could speak the airline industry protocols used by many interfaces to a reservations system (MATIP, for example) don’t exist. We also have a set of custom XML interfaces as well as a large collection of other interfaces that we need to send traffic to, or read instrumentation from. Our initial load generation script not only generated this traffic but also took care of all the other functions required to run an experiment, but this monolithic script didn’t scale. We ended up breaking up that script into agents that can be distributed across many machines, with each agent performing a single function needed for a load test. The agents are run by a master scheduling script which co-ordinates agent start and stop. In this way we can be sure that instrumentation requests aren’t blocking the load generation tools from working, and we can also schedule periodic events, report status, and do the hundred other things required for a full-system load test.

We gather a lot of metrics during a test, and for every major performance test we automatically generate a dashboard to help us drill into the results, a subset of which looks like this:



We gather this data from the system via SNMP, munin, per-component instrumentation, and other monitoring tools. We’ve been very happy with munin in particular as you can quickly add support for gathering new data types from remote hosts by writing simple Perl scripts.

Continuous Automated Testing

In any large system I’ve worked on the hardest problems are the integration problems, and a complex multi-component system such as a reservation system has these in spades. When we started doing performance testing, most of the system components weren’t finished and the interfaces between components kept changing. Furthermore, airline schedules, inventory and availability change rapidly over time.

There are countless factors that play into the performance and scalability of a complex system, and there are many philosophies around testing such systems, but in this post I want to discuss the technique that saves us the most time and money: continuous automated performance testing.

As discussed in the groundbreaking article Continuous Integration & Deployment In The Airline Industry [note: article not groundbreaking], ITA uses Hudson to build and test a complete reservation system on each check-in to the source tree (provided a build is not in progress). Hudson deploys the built software to a cluster of machines that are dedicated to continuous performance testing. After deployment, the load test master control software I discussed earlier runs a fixed scenario of load against the newly-deployed software. After a run completes, we store all of the results and instrumentation data in a database and update the graphs which trend test results over time. If our scripts find too much deviation in run time or throughput between this run and the previous runs, we set a status code so that Hudson can tell the people who’ve checked in since the last run that they may have broken the build.

Having a visual representation of performance issues in the continuous test environment has helped us tremendously because it both shortens the debug time and lets us see patterns of performance over time. Here’s an example of our throughput graph for a single component when someone breaks the build (click on the image for a larger version):

Along the X axis are revision numbers, and on our system the graph will show you the commit messages and the usernames of everyone who committed for each revision when you mouse over the data points.  We also make the graph very user-friendly with a “green lines are good, red lines are bad” design. Clicking on a data point will bring you to our internal source code repository browser.

Throughput, which is shown in the above graph, is only one side of the story. What about the run time of the system during the issue with revision 346626?

The multiple trend lines in this graph represent the timings reported by each instrumentation layer in this component. In the case above the graph is saying that the issue is not with CPU time consumed by the component (that trend is flat), but is instead with time spent in the database. This helps us quickly narrow down where to start looking for the cause of the performance problem. In this example, the developer fixed the issue quickly because the developer had notification of the failed test within an hour of check-in and had all the tools and data needed to isolate and resolve the problem.

At ITA we have environments we use to run large-scale performance tests, but the setup, execution and analysis for such tests are very expensive in terms of computers (many hundreds) and people (tens for what may be a few weeks for a single test). Those resources aren’t cheap, and the wins from automating performance testing finding a single bug save us more then the cost of the computers and people we invested in building this system — and we routinely see 2-3 performance regressions in a month.

It doesn’t take many computing resources to build a system like the one I’ve described. Here are some tips for doing this yourself:

  • Use real machines, as virtual machines suffer from the other guests on the same machine
  • Define a fixed workload you can replay via your load generation tool as this lets you establish a baseline to trend and alert from
  • Make sure your workload represents the majority of the types of load you’d see in production
  • Start simple and add metrics and instrumentation as you need them, not before
  • Don’t worry about fancy presentation of the results – it is more important that you start getting results
  • Publicize your testing system widely once it is up and running to help spread a philosophy of continuous testing in your organization

If you’ve got any questions I’d be happy to answer them in the comments and would love to hear about any systems like this that other people have built.

Recent Readings

  • How MySpace Tested Their Live Site With 1 Million Concurrent Users – Until recently at ITA I ran the reservation performance testing group in operations and can appreciate how hard it is to do good performance testing, and the scale of this experiment is awesome. The article is light on details but the comment by Todd Hoff makes this worth a read.
  • 20 DevOps Guys You Should Follow – Smart people who blog about operations & development hanging out together.
  • What Is DevOps? – Another “What is DevOps?” post, but you should read it because it is by Damon Edwards and includes this image:


This pretty much sums it up. (via Damon Edwards)

  • Who Owns The Application – Collaborate and communicate.
  • A Few Billion Lines Of Code Later – Excellent article about the evolution of Coverity‘s static code analysis tool from a research project to a real product. I think this article does an good job of illustrating that what your customer wants and needs and is almost never what you expect. Everyone who has been in startup will identify with the problems Coverity faced (and is probably still facing).

Continuous Integration & Deployment In The Airline Industry

Jim Bird had interesting things to say about continuous deployment in a recent blog post on his site, Building Real Software. Jim concluded a blog entry that is otherwise full of useful insights with these dismissive paragraphs:

It’s bad enough to build insecure software out of ignorance. But by following continuous deployment, you are consciously choosing to push out software before it is ready, before you have done even the minimum to make sure it is safe. You are putting business agility and cost savings ahead of protecting the integrity or privacy of customer data.

Continuous deployment sounds cool. In a world where safety and reliability and privacy and security aren’t important, it would be fun to try. But like a lot of other developers, I live in the real world. And I need to build real software.

I commented on Jim’s blog that I work on building airline reservation systems at ITA Software and we try to do as much continuous deployment and continuous integration as possible. We are absolutely far from perfect in what we do, but accepting that is the first step to accepting the evolutionary model of software operations.

I think the use of continuous integration/deployment (CI/CD) is orthogonal to issues around privacy, security and safety; if you don’t care about privacy, security and safety then you’re writing bad software, whether you choose to do CI/CD or not.

The reservation system ITA has built is a large, mission critical, multi-component, distributed, high-throughput transactional system. We run our software on Linux on commodity hardware, and the components are written in a variety of languages (Python, Java, C/C++, PL/SQL and LISP). Each component has to be highly available. The software needs to be secure; we process credits cards, flight information and sensitive passenger information. We don’t implement the systems that measure fuel or balance the plane, but as with any part of the airline industry, safety is very important.

So how could we possibly continuously deploy or integrate this software? We deploy an entire reservation system to our development environment at least three times a week. We run an automated set of integration tests against this complex system to verify a deployment. We build and package each component of the software automatically on every check-in to our source tree and automatically run a set of tests against this software. We build controls around privacy, security and safety throughout this system.

We trigger our build/package/deploy cycle using Hudson and custom scripts. The build process is unique per component but generally follows industry standard practices per language or technology, and the packaging is done with RPM. The interesting part, and the part that makes CI and CD work for us, is that we’ve built software and processes to represent the reservation system as a whole. We package manifests that represent, in Python’s Coil, the dependency matrix of the components and services that make up a working reservation system. The coil in the manifest file details all of the software RPMs, component configurations, service validation scripts to be run, monitoring configurations and more. Manifests themselves are revision controlled, and each manifest has an ID that is all that is needed to start a deployment. If we chose to, we could have a manifest built and deployed on every check in to our source tree (this isn’t feasible due to human and computer resource limitations, but is technically possible). Manifests can be promoted throughout the other environments as needed, so we can move from the automatically deployed and tested environments to customer facing or testing environments that may need to be static for long periods of time.

Our deployment framework can automatically control the state of our monitoring. The framework will suppress monitoring during deploys, check monitor states any time during a deployment, and enable monitoring at the end of the deployment. The framework also ties in to our ticketing system by automatically opening a ticket for every deploy and documenting deploy state in the ticket. If a deployment fails, we can track the resolution directly in the ticket that the tools opened for the deploy. The deployment framework automatically resolves the ticket it opened after a successful deploy.

We also use service command and control software that we’ve built in house (similar to ControlTier) to make sure the services are in the correct state. We wrote our own service management framework because at the time we started this project there wasn’t existing software that met our particular needs; now there are many excellent solutions.  Our deployment framework, which is driven by the manifest described above, has the ability to work with our service management framework so we can verify the state of our components as part of our deployment.

One of the differences between our CI/CD process and the process at Flickr or Facebook is that our customers, both internal and external, want predictable change and often dictate our release cycles. Perhaps this is what Jim means by CI/CD putting customers at risk, because some customers don’t want continuous updates to their software. Despite this, we still do CI/CD internally at ITA because failing a customer deploy can mean an airplane doesn’t fly. I’m not interested in learning how to deploy a reservation system the day of a production deployment with those kinds of stakes.

The big advantage of automating our deployments as much as possible and doing as many deploys as possible is the same in the airline industry as it is at any company: we deploy a lot so we know our deploys work. Continuous deployment is nothing more than another step in assuring that you are minimizing errors throughout your service. Not doing CI/CD is like not doing QA.

I’ve got more stories about the successes (and many, many struggles) of CI/CD at ITA and they’ve been kind enough to give me permission to post some of the stories here (we do some really cool things in performance testing that I’m excited to write about), so please check back often for more post about CI/CD at ITA.

DevOps Documentation

In a previous entry I wrote that the key to removing the wall between developers and operations is communication. I wrote about becoming involved in your co-workers’ meetings and understanding your co-workers’ needs, and this is of course very important, but sometimes it is best to have things written down, because your co-workers aren’t always around to answer your questions (ops people who have tried calling developers at 4am know this all too well).  Good documentation is often toted as the answer to all of life’s problems, but we never make time for it. I’ve found a few ways to reduce the barriers to creating documentation, which I discuss below.

First, documentation needs to be revision controlled and live with the code, as doing so will reduce the cost for the developers to create documentation and help keep track of changes to documentation. This also means you don’t use binary formats for your documentation–no OpenOffice Doc, no MS Word, no format that you can’t store in git, svn, or whichever revision control system your company uses. You must be able diff revisions. Documentation stored primarily in plain text is ideal because plain text is a universal format; plain text can be emailed to anyone, edited by anyone, read by anyone and easily manipulated with tools readily available to both developer and operations.  My current choice for documentation is ReStructured Text, which you already know how to write even if you haven’t seen it before. ReST is a plain text format that evolved out of the docutils package and focuses on simplicity and clarity of meaning.

Furthermore, documentation that will be used by the developers and ops people should live with or near the code that the documentation is for so that the docs are easily accessible to both groups. Documentation stored in revision control also means that you can set up commit hooks that trigger alerts when the documentation changes. If you use a web-based system for browsing source code that system probably offers an RSS feed that can notify you of changes to files, so you can see when your documentation has changed without leaving your RSS client. The automation possibilities are endless–if a specific source file got updated but the docs didn’t? Send an email to the person who made the commit asking why.

Keeping the documentation within the source tree has the advantage of the documentation always being checked out with the code, and thus the most recent revision is available to all the developers, and the ops people keep up with the source tree as they work with the documentation.

The DevOps Dialogue Document

One of the first documents that an ops person looking to bridge the development divide should create is a dialogue document. This is a document that the developer and the ops person work on together that tracks all of the information needed to handle the care and feeding of the software. What should this document look like?

Administrative Details

  • A description of the software role and purpose.
  • A contact in development, a contact in ops, and a contact in QA. These are the people that can be bugged about issues with the software, and these are the technical, not managerial, contacts.

Operating Requirements

  • Disk space required for installation, disk space used in normal operations, disk space used by logs, disk space used by data, and estimates of growth rates.
  • Memory requirements.
  • Network requirements, including ports the software listens on, connections the software makes to other services, protocols used (at all layers), if SSL is required, and so on.
  • OS requirements, such as platform, distribution, release of distribution, kernel versions, patch versions, word size (32 or 64 bit), other unique OS requirements.
  • Environment requirements, such as libraries required, variables that need to be set, shells that are required, users and groups that need to be on the system, and more.
  • Details on how to start and stop the software.

Configuration Details

  • Details on how the software is configured; is it via a config file? Is the configuration in a database? Is there a way to change the running configuration on the fly?

Monitoring & Debugging

  • How does the software get monitored? Is there a way to have the software report status? What about performance monitoring?
  • What process do the developers use to debug issues they encounter in the software? For example, is there a mechanism to get a stack trace, such as Java’s handling of SIGQUIT?
  • How does the software respond to common error situations like out of memory or out of disk space, or being unable to bind to a port? What about handing of bad input?

Backup & Restore

  • How does the software store state, and can you easily backup and restore that state?
  • Can you do a backup while the software is running and get a restorable backup? How can this be tested?

Security

  • What privileges are needed to run the software?
  • How is input validated?
  • How are any authentication tokens (passwords, certificates, etc) stored?

There is a lot more that should go in each section and many more sections that could be added; the above is an example that you can use to start your document.

At the last Boston DevOps Meetup, @hercynium mentioned that he uses the FCAPS model to figure out what he needs to know about software he is going to deploy and manage. I hadn’t heard about this model but after some quick research it looks like an excellent reference for helping you build your dialogue document.  From the Wikipedia entry:

FCAPS is an acronym for Fault, Configuration, Accounting, Performance, Security, the management categories into which the ISO model defines network management tasks. In non-billing organizations Accounting is sometimes replaced with Administration.

These topics map well to the items that developers and ops people both need to understand about the software they are working on together.

No software will have the correct “answers” for the dialogue questions, and that’s not the point. You’re trying to start the dialogue that gets everyone thinking about software outside of their silo. I’ve written the questions above from the perspective of an ops person, but the developers should add their own set of questions — maybe a section on what the production environment is like and how software will be pushed to production. Since you are storing this document in plain text along side your source code under revision control you can easily check in new answers or questions and have those changes be seen by the developers on next update.

Developers and operations people lament the perceived lack of documentation and everyone agrees we need more. The purpose of the dialogue document is to create the documentation that is needed in the lowest cost way, so don’t try to make the perfect document from the start. Embrace the iterative and evolutionary nature of Agile and grow the document as your understanding of the software grows.

Recent Readings

Some interesting articles & tools:

Getting Started With DevOps

DevOps has been defined in this article by Stephen Nelson-Smith, and the executive summary is that operations and development should no longer be separate functions (and never should have been) and need to start working closely together.

Why? Without working together, failures inevitably occur. For example, at the last Boston DevOps Meetup, one of the attendees, a developer, was commenting on the disconnect between him and his sysadmin and how their relationship was unlike the devops model.  

“That all sounds nice for you guys, but my sysadmin at work doesn’t seem to care about any of this. He’s not engaged.” 

The developer went on to give examples of times when the production servers broke the code because the production servers were configured incorrectly, or the ops person didn’t assist in debugging a problem because the ops person felt the problem was the developer’s to deal with.

We all talked about this for a while, when I realized that in another bar in another town, that developer’s sysadmin was saying to his friends, probably over a beer, “That all sounds nice for you guys, but my developers don’t care about any of this. They’re not engaged.” The sysadmin probably went on to talk about how the developers don’t keep their configurations sane and how they never debug the problems they create.

This is why we need DevOps. (And probably, really, DevOpsQASales, but that’s another post).

How do you get started with DevOps at your work?

  • If you are developer, invite your ops guys to your scrum or weekly meeting. Make sure they come and always ask them if what you are talking about has an impact on their work.
  • If you are an ops guy, invite your developers to your scrum or weekly meeting. Tell the developers about upcoming changes in each environment. Ask the developers what is going in their world.

If you’re a small team, bring everyone, if you are large, bring one developer from each project, but don’t invite the development managers, or the ops managers. Invite the people who do the work. You need to have the developers who will say, “We’re implementing a feature that uses these extra libraries, can they be installed in production?” and the ops people who will say, “Oh, if you use v2.8 of that library it won’t work on the older machines because of x, can you guys use v2.9?”. You want this to happen before you go to production.

Adding meetings to your calendar always sucks, but you’ll save headache later by talking to each other, and more importantly, you’ll buy into the projects that everyone is working on. You’ll believe in the work others at your company are doing and want to help if there are issues.

The developer above should be talking to his ops guys on a daily basis. They should go for a beer and talk about technical problems at work. I guarantee one of them will say, “Oh for that I just do x y and z, and it works great.” and this will be the solution to a nagging problem.

Technology, of course, can help a lot. In the example above, the ops guy should:

  • Create virtual machine images of production on a regular schedule that the developers must run their code on as part of checkin cycle.
  • Use a configuration management tool such as puppet or chef to keep staging and QA environments matching production environments.
  • Talk to the developers about the network, hardware and software that make up the production environment, including details of the resource limitations of those components.

While the developer should:

  • Write up details of the software’s operating requirements in terms of resource usage, environment configuration, and other dependancies.
  • Package the software properly, so the ops people can review package manifests on upgrades automatically to track changes in QA and staging.
  • Codify operational issues in unit tests (lossy network unit tests, disk full unit tests, out of memory unit tests, blocked port unit tests).

Working together:

  • Use build tools such as Hudson to trigger jobs on checkin – jobs which run the full set of unit tests on a production-like environment/virtual machine.
  • Define hand off procedures for new features, which require checkoffs from ops, QA and development.
  • Push the ops deployment scripts and methods to the developers’ workstations so the developers can use them.

Also, whenever possible you should automated that which can be automated. That’s what computers are really good at.

These are just a few examples. There’s a lot to talk about in later posts, such as more detail on the heavy use of automation, Agile practices in operations, proper storage of documentation (version control! plain text formats!) and more.

I hope this is useful introduction to some DevOps concepts as I’ve understood them. Please comment below about your own experiences.