Shipping fast with smaller steps

Former Director of Engineering, Intercom

June 30, 2017

Main illustration: Ally Reeves

How do we know what features are useful? We don’t. It’s trial and error really.

Some features work really well. Some features you think nobody would care about are really popular. Some features just fall flat. It’s engineering’s job to support this. You need to be able to move fast, to be able to make mistakes and to come up with a bigger and better idea quickly.

Here’s a good example. In August 1977 a human powered aircraft made aviation history when it flew around a one mile course while never dropping below an altitude of 3 metres. Consider it the minimum viable aircraft.

The Gossamer Condor won a £50,000 prize for its inventor, Dr Paul MacCready, who less than two years later designed its successor the Gossamer Albatross, which successfully flew 22 miles across the British channel between England and France, earning a £100,000 prize in the process.

During the 1960s and 70s several individuals and groups designed a human powered aircraft in an effort to win the Kremer prizes, cash awards established by the British industrialist Henry Kremer in 1959.

Although others had managed to create human powered planes that flew short distances they were heavy, hard to control, and even harder to repair when they crashed. MacCready’s creations were inspired by gliders and early aircraft. They were extremely lightweight and used very simple technologies that could be repaired quickly.

MacCready realised the problem was not building an aircraft that could be human-powered, but building an aircraft that could be repaired at an airfield. Because the Gossamer could be repaired on the runway iterations took minutes, not months.

We try to embrace a similar philosophy about engineering at Intercom. Our aim is always to move faster with smaller steps.

Our process for shipping probably looks pretty similar to your company. In terms of tools, there’s nothing too surprising. We use GitHub for code, Circle for tests, and then staging and production on AWS. Nothing controversial. But unlike a lot of other companies, we don’t spend a lot of time in the first two sections. They spend maybe a quarter or half a year coding and testing and then they’ll do a big deployment and push to production.

We think about things a little bit differently. Rather than shipping new product quarterly, monthly or weekly, we deploy new features to our customers up to 50 times a day. We do this thanks to a custom-developed tool called Muster that manages our infrastructure and how we deploy. It’s based on a mixture of designs we saw in the open source world but its premise is simple: when a new change appears, prepare it in a way that’s repeatedly deployable and safely manage that deployment. It gives us the ability to safely roll back and to scale up or down different parts of the infrastructure, and do so quickly.

It’s not the only thing that helps us move quick. We test relentlessly on ourselves, so feature flagging is a regular part of how we build product. We’ll release to ourselves first, then to a select group of customers and then roll it out to our customers. It means we get early feedback into whether something’s going to work or not.

null

But over the past few years, there’s one thing that has helped us ship fast by many orders of magnitude. And that’s by moving to a service-oriented architecture.

So if you’re not familiar, the basic idea is to take a big application, split it into lots of smaller services and then compose your bigger service from lots of these smaller services. Full disclosure: there is definitely benefits and trade-offs on the technical side. There is risks that you open up and there is a whole bunch of infrastructure that you have to have in place.

But in my humble opinion, the organizational benefits from a service-oriented architecture far outweigh the tradeoffs. When you’re dealing with teams across a company, the communication overhead can be boiled down to an API, and some SLAs on that API. You don’t need to have a lot of meetings to figure out exactly what you need. “Here is your API. Code against it, and it should work within these parameters.” As a product company, we’re not just going to rewrite a whole bunch of code just because it’s fashionable, or because it would be fun to write that code. We keep SLAs sitting on the table until we find a problem that we really think it would fit well to.

By having a fast deployment time and shipping bit by bit, you’ll shorten the feedback loop.

A really great example is an email problem we had. A certain group of customers were complaining that a lot of their messages were being flagged as spam. As you can imagine, for a company like Intercom whose mission is built around personal, no-spammy messaging, this was a disaster! So we did some investigation and found that we had an SPF problem. Because many of our customers hadn’t configured a pretty arcane authentication configuration, many of their messages were ending up in the spam folder.

So I did some digging, picked some customers, sent some messages, put some documentation up and reached out to customers to let them know that: “If you just add this record to your domain, you should see your problems go away”. The response was fantastic. The problem disappeared almost immediately.

At that point, we were faced with a technology problem. Reaching out to customers 1:1 is not going to scale, so we needed to automate it somehow. And that’s where the SPF Validation Service came in. It was a natural place to do it, as checking SPF records doesn’t really belong in a Rails application. And it felt like a safe way to start using service-oriented architecture since it’s not core to the product. As it was decoupled, if the service fails it’s not going to cause any errors to all our users. People aren’t going to get a 500-response in the browser if the service is having problems. It was a simple, safe way to solve our problem.

So I wrote some Java with Dropwizard, which has a pretty handy SPF library. (The Ruby one is not licensed in a friendly way.) I wrote a client to hook it up to the app, deployed it, and we now have a service that uses the standard Intercom APIs to query what users might be affected and then send them a message to say, “Hey, you should go check that out.”

This is just one example amongst hundreds of how we’ve taken small steps in favour of shipping large, complex solutions to our problems. This dedication to shipping small pieces of valuable code on a regular cadence is something to be especially mindful of as your company grows. If your company is successful, over time a lot of things are going to grow. You’re going to hire more people, you’re going to write more code, and your codebase is going to grow. All this conspires to add complexity and slow you down.

By having a fast deployment time and shipping bit by bit, you’ll shorten the feedback loop, minimise complexity and dependencies and it means you can can adapt and learn as you continuously push code to production.