Startups | 10 min read

Moving faster with smaller steps

hero

In August 1977 a human powered aircraft made aviation history when it flew around a one mile course while never dropping below an altitude of 3 metres. Consider it the minimum viable aircraft.

The Gossamer Condor won a £50,000 prize for its inventor, Dr Paul MacCready, who less than two years later designed its successor the Gossamer Albatross, which successfully flew 22 miles across the British channel between England and France, earning a £100,000 prize in the process.

During the 1960s and 70s several individuals and groups designed human powered aircraft in an effort to win the Kremer prizes, cash awards established by the British industrialist Henry Kremer in 1959.

Although others had managed to create human powered planes that flew short distances they were heavy, hard to control, and even harder to repair when they crashed. MacCready’s creations were inspired by gliders and early aircraft. They were extremely lightweight and used very simple technologies – even by the standards of the 1970s – that could be repaired quickly.

MacCready realised the problem was not building an aircraft that could be human-powered, but building an aircraft that could be repaired at an airfield. Because the Gossamer could be repaired on the runway iterations took minutes, not months.

We try to embrace a similar philosophy about engineering at Intercom. We want to be able to to move fast, try stuff out, make mistakes. If something doesn’t work out, we figure out a new thing to try. Our aim is to move faster with smaller steps.

Continuous deployment is one of the main strategies for how we’ve achieved that to date. Rather than shipping new product quarterly, monthly or weekly, we deploy new features to our customers up to 50 times a day. We have developed an in-house tool, Muster, which enables us to do that efficiently.

In the future we believe a service oriented architecture will enable us to continue to work in this manner even as our engineering teams grow.

In this talk I delivered at our recent Inside Intercom Engineering event I expand on how both these concepts – continuous deployment and SOA –  have helped Intercom. There’s a full transcript below the video.

“If you don’t know, CD – continuous deployment, SOA – service-oriented architecture. I guess, as Ben mentioned, we’re a product company, so I guess what does that mean for engineering? How does that define what engineering does is what I hope to address. So it’s probably good to start with the why of Intercom, like what’s the purpose of Intercom. I guess the simplest version of our mission is to make business on the web more personal. How do we do that? Well, we try to put together a really good product that addresses, like Ben mentioned, some of the features that customers will find useful.

How do we know what features are useful? We don’t. It’s trial and error really. Some features work really well. Some features we thought nobody would care about are really popular. Some features just kind of fall flat. So engineering’s job is to support that. We’ve got to be able to move fast. We’ve got to be able to make mistakes, try stuff out. If it doesn’t work, figure out a new thing to try.

I think a really great example of this is back in the 1950s in the U.K., there was a British industrialist called Henry Kremer and he created the Kremer prize. He became completely fanatical about human-powered flight. He really thought it was possible and it was going to revolutionize the world. So he offered, I think, £50,000, which was quite a chunk of money 1950 and still quite a chunk of money now, to the first person who could pilot an aircraft powered by themselves around two markers a half mile apart.

So people tried it. People made lots of different attempts, but it wasn’t won until 1977 by an American engineer called Paul MacCready and the way he approached the problem was completely different to how everyone else had approached it. So he noticed that what people had been doing was building a plane with the most advanced materials they could find and taking guesses about what would work well and what wouldn’t and spending about a year building a prototype, taking it to an airfield, flying it, crashing it, taking it home, spending about a year taking to it to an airfield. And he realized the problem wasn’t building an aircraft that could be powered by human. The problem was building an aircraft that can be repaired at an airfield.

So he tackled it using not the state-of-the-art stuff, much simpler tools, much more simpler technologies even for the ’70s, and he built something that could be repaired and fully rebuilt in about six to eight hours. So I think in under a year after starting it, he’d won the prize. And he actually went on to, there was a second level of Kremer prize, I think it was 100,000 to fly across the channel and Paul MacCready won that too with one of his designs.

So that’s kind of the philosophy of engineering at Intercom. We want to support the product moving fast. Well, actually one of the real nice things I like about intercom right now is that we’re defining our culture, and one of the things that we think is pretty critical is a culture of shipping. Yeah, thanks Jamie. What does shipping at Intercom look like? Probably very similar to the tools that you guys use if you’re writing code. I guess some of the emphasis is on moving quickly. So on your first day, you’ll commit to moving code to production. In your first week, you should deliver something pretty meaningful. In terms of tools, I don’t there’s anything too surprising. GitHub for code, Circle for tests, and their parallelism makes our tests run pretty fast, and then staging and production on AWS., not particularly controversial.

I guess what’s slightly different about Intercom is that for a lot of companies near here and around the world and probably not too far from us right now, they spend a lot of time in the first two sections. They spend maybe a quarter or half a year and then they’ll do a big deployment and push to production. At Intercom, we kind of move a little bit quicker. We do the full run about 50 times a day. We have our own in-house, custom-developed, Darragh-developed really, deployment system. And it’s based on a mixture of designs that we saw in the open source world and things we were inspired by other places and it works pretty well. It lets us move pretty quickly.

We’ve also got some other things that help us move quick. I think, as Darragh mentioned before, we test relentlessly on ourselves, so feature flagging is pretty common and we’ll release to ourselves first and then maybe to a select group of customers and then roll it out and we get pretty early feedback if something’s going to work or not. So overall, I think we’re probably doing pretty well in terms of like the standard development process. There’s a list of things I could think of that make me a productive software developer. And I think we’re doing pretty well on all those. There’s definitely things missing from that list. And if you’ve got suggestions, I’m very open to hearing them. But we’re really looking for what’s going to make it an order of magnitude better. What’s the next biggest impact we can have? And we thought about it and we really think it’s a service-oriented architecture.

So if you’re not familiar, the idea is basically you take a big application and you split her up into lots of smaller services and you compose your bigger service from lots of these smaller services. And I think from Bill’s talk, there is definitely benefits and trade-offs on the technical side. There is risks that you open up and there is a whole bunch of infrastructure that you have to have in place. But from my own perspective I think organizational benefits are the biggest win from service-oriented architecture. The communication overhead, when you’re dealing with teams across company, kind of get boiled down to just an API. and some SLA’s on that API So we don’t need to have a lot of meetings to figure out what exactly you need from us. “Here is your API. Code against it, and it should work within these parameters.”

So we’re a product company and we’re not just going to rewrite a whole bunch of code just because it’s kind of technically cool or because it would be fun to write that code. So really, we kind of kept SLA’s sitting there on the table until we find a problem that it would fit well to. So a really great example is an email problem we had. This is actually in the responsibility of my team. A lot of customers are complaining, well, a certain group of customers were complaining that a lot of their messages were being flagged as spam. So we did some digging, did some investigation, and we found that we had an SPF problem. A whole lot of people, we were sending mail on their behalf, but they hadn’t configured a pretty arcane thing, I won’t bother you with, about mail sending and authentication. And only 20% know how to properly configure it, so 80% of our customers had a pretty bad experience around this.

So I did some digging, picked some customers, sent some messages, put some documentation up there, and kind of manually messaged them and said, “Hey, your domain, if you just add this record, you should see your problems go away and you’ll find things work a little bit better,” and their response was great. Our customers really responded well and saw our problems disappear immediately. So at that point, we reckoned we had a technology problem. The manual messaging of users is not going to scale, so we needed to automate it somehow. And that’s where Spiffy, sorry, SPF, my name, SPF Validation Service came in. It seemed like a natural place to do it. Checking SPF records doesn’t really belong in a Rails application. And it felt like a safe way to start using service-oriented architecture since it’s not core to the product. If the service fails and we’ll get a whole bunch of pagers, you’d be alarmed if that happens, but if it does fail, it’s not going to cause any errors to the users. People aren’t going to get a 500-response in the browser if the service is having problems. So it seemed like a pretty safe way to do it.

So I wrote the code. In this case it happens to be Java with Dropwizard just because it had a pretty handy SPF library. The Ruby one is not licensed in a friendly way. Wrote a client for it to hook it up to the app, deployed it. It took about two days and that’s the output. It uses the standard Intercom APIs to query what users might be affected and then send them some messages to say, “Hey, you should go check that out.”

So that’s kind of the end of the story, but I suppose the last thing that’s missing from that is a graph of how well we actually did. It turns out not too bad. We saw a pretty good return and it’s still working right now. And that increase has tailed off, obviously, as we’ve kind of hit everyone. And actually, that graph, it comes from a blog post that we did about this more from the product side, but it’s still an interesting read just about how SOA and continuous deployment, moving fast and breaking up the complexity of our system a bit helped us get the right result for our customers.”