Building autonomously: Improving our production systems in small steps

Main illustration: Emilio Santoyo

At Intercom, we have a few values that underpin our engineering culture – moving fast but optimizing for the long term; doing less but doing it better; and taking ownership of our areas of responsibility and the things we build.

These values allow us to be more autonomous as engineers because they provide a structure for us to make better decisions. When you’re used to working with them all the time, they can fade into the background a bit, but they become especially useful when working in isolation from your team.

A few months ago, I was visiting Chicago for personal reasons and so would be working away from my team for two weeks. There wouldn’t be other engineers in my timezone for most of the day, so picking a self-contained project that I could own from start to finish was necessary.

My engineering team, Production Systems, is responsible for the availability, scalability, security and costs of Intercom’s infrastructure. From our backlog an intermission stood out: reduce our Amazon Web Service (AWS) costs by up to 50% by maximizing our use of Spot Instances. The cost of this spare AWS computing capacity varies on an hourly basis according to availability and demand.

We needed to make the useof Spot Instances in Intercom safer and easier.

We manage EC2 hosts with Auto Scaling Groups (ASGs). Amazon Web Services makes launching Spot Instances in your ASGs quite easy; you just set the maximum spot price you are willing to pay. Sometimes the AWS Spot Market can send the spot price very high and clusters are left without capacity, making it dangerous to use Spot Instances for “near-realtime” clusters. But if, during these market blips, we could replace Spot Instances with On-Demand Instances, where you only pay for EC2 instances you actually use, we could extend the use of Spot Instances, significantly reducing our AWS operating costs.

At the beginning of the year, the team tried to deploy an open-source project called Autospotting. From the description it seemed to address all our problems, but after deployment, we decided it wouldn’t work for us as it seems designed for less complex architectures than ours.

We needed to make the use of Spot Instances in Intercom safer and easier and felt we had two options: fork Autospotting to make it work in our environment and get the changes upstreamed later; or build our own implementation of Autospotting with just the bits we needed.

These options presented an interesting technical challenge, relatively small and with a big impact, exactly the type of project that would fit for my trip.

Reducing friction

To optimize my two weeks away, I needed agreement from my team on a system design in advance in order to reduce communication friction while in another timezone.

Autonomy doesn’t just mean ‘working on your own’.

Autonomy doesn’t just mean “working on your own”. It means you get to choose a project, own the design and set your own success criteria. At Intercom, we don’t require sign off – we are expected to do “sufficient due diligence” ourselves and bring in other stakeholders and senior engineers as needed.

This means early alignment on a design is critical to enabling engineers’ autonomy. So I evaluated each option and shared my thoughts with the team for feedback. Here’s a summary:

  • Option 1: Fork Autospotting. Autospotting is a Go binary that runs in Lambda and has some limitations.The use of Go is becoming common in Intercom but is not a standard technology yet. Half of my team has never worked with it, myself included. Patching it would have required that I learn Go, the code base and patch it to match our needs; all this in 2 weeks while away from the team and 6 hours ahead of them.
  • Option 2: Build our own solution. By having carte blanche, I could have started coding in Ruby from day 1, built only the features I needed, and got the V1 of service into production by the end of week 2.

I preferred option 2; this service was going to be a core component of our infrastructure. It had to be reliable and understood by everyone in the team. I wasn’t comfortable with forking and patching Autospotting without having more support from my team. In this case, the tradeoffs were too high to continue trying to implement Autospotting, and I could have more impact, more quickly by building our own implementation.

The team agreed with my plan. I committed to deliver a cupcake of Intercom’s version of Autospotting in a week and left for the Windy City.

Baking a cupcake

I wanted to start small by building something that confirmed my approach and to prove to myself and my team that building our own solution would work. So on Monday morning, I began prototyping a simple synchronous Ruby service that was going to replace the instances in my test ASG and optimize the cost. I called it ASG-optimizer.

Initially, I could have started designing the perfect service that was going to cover every edge case, perfectly scalable and extremely efficient; but given I had just two weeks to work on the project, I didn’t want to spend time working on problems I didn’t have yet; I wanted to get something working and I had to prove that this “option 2” was truly a reasonable alternative to Autospotting.

By the end of the week I had a single process that was synchronously watching my ASG, taking an On-Demand instance, made a copy of it and swapped them once it passed the EC2 checks. Here’s what my initial commit looked like:

github

Making a cake

In order to make the ASG-optimizer production ready, the next step was to fix the biggest pain point with the proof-of-concept: scalability. With the initial commit, I could only replace one of thousands of our hosts per minute. That just doesn’t work for an infrastructure of Intercom’s size. So I moved the ASG-optimizer to an asynchronous model and used SQS to split the single-process into three processes:

  • Process 1 monitored all Auto Scaling Groups with the tag “AsgOptimiser” and add any replacement candidate to a ‘Spot Instance Creator’ SQS queue.
  • Process 2 polled the ‘Spot Instance Creator’ queue, created a Spot Instance, then added [new Spot Instance ID, old On-Demand Instance ID] to the ‘Instance Swapper’ SQS queue.
  • Process 3 polled the ‘Instance Swapper’ SQS queue, waited until the Spot Instance passed EC2 checks, swapped the instances, and terminated the old On-Demand Instance.

process

At the end of the second week I deployed the new asynchronous service to production, optimizing a single Auto Scaling Group, prepared a rollout plan for my team and came back to Dublin. Then on the following Monday morning, I shipped it for all our near-realtime clusters and learned that it needed another bit of work, such as reducing the number of API calls to AWS and introducing orphan-instance cleanup.

ASG-optimizer was now enabled for about 140 Auto Scaling Groups replacing about 650 instances every day. This would halve our EC2 bill over time.

improvement

The icing on the cake

Even two weeks of engineering work can have a significant impact on operating costs. But even though ASG-optimizer is now in production, and the project is technically considered done, I’m still discovering ways to improve it.

ASG-optimizer was built with the intention of scratching an itch that Intercom had.

We recently noticed that the ASG-optimizer is not playing well with our previous cost strategy which used Reserved Instances to save money on instances we knew we would need, long-term. We are now using too many spot instances and not fully using our reserved instances. ? To solve this we’ve been thinking of making the service Reserved Instances aware. I’d love to start using Spot Instances not just for batch-processing workers but for user-facing web fleets too.

The ASG-optimizer workflow could also work for updating our custom Amazon Machine Images (AMI) that define the operating system for our EC2 instances. Our continuous integration and continuous deployment system also does our AMI rollouts. AMI rollouts can block deployments as well as put pressure on our datastores when restarting everything. Offloading the work to the ASG-optimizer will help us replace old AMI instances safely as well as removing the duplicate logic and complexity.

ASG-optimizer was built with the intention of scratching an itch that Intercom had, and may not work for everyone, but if there is interest from the community, we are willing to open-source it.

When you work autonomously, getting alignment with your team is critical. Above all, remember that adhering to the same approaches and principles as you do while working as part of the team will bring the best results.

If this sounds like the sort of place you would like to advance your engineering career, we’re hiring for systems engineers.