Honeycomb’s Charity Majors on operations engineering

Former Software Engineer, Intercom

June 29, 2017

If your startup is gaining traction, nothing will kill its momentum or the trust of your users quite like downtime. Availability is everything.

https://art19.com/shows/inside-intercom/episodes/089a29ec-6a9a-4148-a27b-750425520ca3

Charity Majors knows this all too well. She’s been on call since the age of 17 and has done stints as every sort of systems engineer and manager. Most notably at the mobile app development company Parse, where she was the first very first ops and infrastructure hire, and later at Facebook, where she managed a team of production engineers responsible for the care and feeding of over 500,000 mobile apps. Today Charity is a co-founder at Honeycomb, a real-time observability tool to help you understand, run, debug, and optimize your own production software.

I recently hosted Charity on our podcast to learn about when a startup actually needs a dedicated ops team, why engineers need to get comfortable with failure, the role cross-pollination plays in a healthy engineering culture, and much more. If you enjoy the conversation check out more episodes of our podcast. You can subscribe on iTunes or grab the RSS feed in your player of choice. Short on time? Below are five key takeaways:

One important lesson that all technical co-founders must learn: startups typically don’t fail because of the tech. They fail because of poor execution on the business side, e.g. sales, marketing and support.
A startup doesn’t need a dedicated Ops team right away. In the early days, everyone should be doing their part to create a high quality operational culture.
When it comes to observability, dashboards often become a detriment, because they’re simply snapshots of the past. Charity sees data-driven debugging as the smarter approach.
Engineers should feel empowered to swing back and forth between management and contributor roles. The best in either role have experience in the other.
Encourage your different engineering teams – ops, security, frontend, etc – to sit and work together. Often the most creative ideas come from cross-pollination.

Ingrid: Welcome Charity, it’s great to have you with us today. To get things started, tell me about your journey from being an engineer to now being a founder.

Charity: It was very accidental. I think of myself as much more of a person who executes, not a person who has ideas. A lot of us on the backend approach software this way. It’s like, “Alright, you’ve got an idea and people like it, I’ll make it work for you.” I’ve always taken a lot of pride in that, and the problem that we are trying to solve became one that I literally felt like I couldn’t do my job without.

We had built this thing at Parse using a lot of Facebook technology. We were serving more than a million mobile apps on Parse, and we had all these problems of high-cardinality, platform problems, things where we built this elaborate system, on top of Scuba, to help us diagnose what was happening in production. The idea of going back to not having that was basically intolerable. So, that’s how I find myself here.

Ingrid: Is that how Honeycomb was born?

Charity: My co-founder Christine Yen tells the story somewhat differently. At Parse, I ran the backend infrastructure team and Christine built our analytics product. So she’s building this analytics product, it’s like Mixpanel for mobile apps, and she kept having this experience where our users would write in and ask her how they could answer these questions using our analytics product. And she’d say, “This is so embarrassing, but you can’t.” Because it’s using time series aggregates, on top of Cassandra. She would have to go and look it up inside what we were using internally to answer their questions. When I told her that I wanted to build this, she was like, “Oh my god, I have felt this pain so many times.” So she comes at it from much more of the frontend, and I come at it from like the operational side.

Ingrid: So Honeycomb was born out of what you wish you had at your previous jobs. But looking at your career as a whole, what was it that drew you to operations and infrastructure?

Charity: I really hate monitoring. I hate everything about it. In a very real way, this as a startup was born of hate, and not wanting to ever have to do this again. Our plan, in the beginning, was, “Well, people are shoveling venture capitalist dollars at us. Let’s just take their money, build something, we’ll fail, we’ll open source it, and then we never have to live without it again.” That was the master plan, but we’re accidentally succeeding.

What brings me to operations? I grew up on a farm, and I have a very low tolerance for things that don’t need to be done, for the frills. I don’t sit here and play around with technology. I’m highly motivated by doing what needs and has to be done. And the thing I always loved about Ops is how close it is to the metal. That the business will succeed or fail based on what you do here today. Maybe I have a God complex, but I really enjoy that pressing need, and that urgency.

Ingrid: I feel quite similarly, I love operations and infrastructure specifically because it allows me to take monsters and make them simple and approachable.

Lessons learned as a technical co-founder

Ingrid: We had Marc Hedlund on the show, and he joked that one of the most difficult things he had to learn as a founder was how to become a salesperson. What’s the most unexpected skill you now have to care about because you are a founder?

Charity: He’s absolutely right, sales is hard. One of the things about Honeycomb that I value is that we’ve all been through so many failed startups, and it never fails because of its technology. It’s never because of the tech. This has been very humbling to realize as I grow up. Startups do not succeed or fail on the back of which programming language you choose. It’s all about execution, and that’s execution on the business side – the sales, the marketing and the support.

Startups do not succeed or fail on the back
of which programming language you choose.

I may be over-correcting a bit, but everything matters more than the tech! Maybe this is just because we have the tech handled. Everything else is hard. When we talk to candidates coming in the door this is a refrain that we repeat over and over. You can tell how they’re going to work almost by seeing whether their eyes light up and they’re like, “Yes, this is true,” or whether they just have this distaste in their eyes. We value all of these skills, the squishy skills, around customer success.

Sales is really hard. I was trying to sell Honeycomb for almost a year, by myself, and I’ve never bought software before, so this was really hard conversation for me to learn. How do you ask for a dollar value for the thing you’ve been pouring your lifeblood into? I come from the open source world. I don’t know how to do that.

I had almost decided that we had failed, that this thing couldn’t be sold, and I hired a salesperson as a last-ditch effort. He took us from zero to hundreds of thousands of dollars, within five weeks! I suddenly realized, “Oh … this is a skillset.” I had known that but had not viscerally internalized it.

When does a startup need ops and observability?

Ingrid: You mentioned candidates. When should a startup start caring about hiring its first operation specialists? Is there anything that signals, “Hey, it’s about time”?

Charity: We are actually at that stage now, where we’re asking the question, “When do we need to have an Ops team?” In some ways we’re cheating, because we’ve gotten a long ways with me and Ben Hartshorne. In a month or two of work we basically did all the Ops work that we’ve been piggybacking on for the last year and a half. But I think most people hire Ops too soon, and they do that because they aren’t willing, or don’t know how, to learn the basics themselves.

As a founder, or as a hiring manager, the first step in making any good hire is to do the job yourself. At least fill out the contours of it, to know how to recognize a good one versus a bad one. People often aren’t willing to do that, because it’s painful and time-consuming, but I highly recommend it.

Most people should start hiring a dedicated Ops team when they have a rotation that all of their software engineers are participating in, from day one, and they’re doing okay, but more and more of their time is getting eaten up in Ops work that they cannot automate away. At some point you need to call in a specialist.

Charity gives a talk at Heavybit about how to hire a tech ops team.

Sometimes people just call in a consultant, who’s really good at helping people make the jump. The thing you don’t want to do is be like, “Okay, I have ten software engineers. I’m going to hire a team of three Ops people and have them do all the Ops work.” That’s how this traditionally goes, and it’s not good pattern, because you really want everyone to continue to participate in the reality of creating a really high quality operational culture.

Ingrid: How soon should a startup care about observability? And at the opposite end, when it’s an established company that already has multiple vendors that solve different problems like performance, and dashboards, and logging, which can become a bit of an operational hell, where does event-level monitoring fit into all that? How can it be added without it making things worse?

Charity: The thing I love about this question is there’s no template. There are patterns that you can learn from, but everybody has to learn this over again for themselves. How soon should a startup care about observability? Literally from negative day one. From the first day. This should be so much about habit that it’s like commenting your code. This is literally how your code explains itself back to you, and you don’t know what’s going on. You can never count on it being a throwaway piece, and you always going to want to understand how it works. Part of the reason people even ask this question is because there’s been so much impedance, so much friction involved in getting started.

Say I’m spending my day one writing a dummy app for Honeycomb – do I really want to spin up a bunch of infrastructure to monitor that? No, I don’t. But there are services now.

At the opposite end, when you already have a lot of vendors, and you already have a lot of stuff that’s been implemented, there are a couple of places you can start. Begin from the perspective of the people who are consuming (the information), which should be everyone. How do you make it simplified? How do you pick something that spans a lot of ground and gives people power?

You want everyone to participate in creating a really high quality operational culture.

This is why we (at Honeycomb) are building from the standpoint of (screw) dashboards. A dashboard is a place you go to stop. It’s a place you consume passively. You’re not asking a question, you have no way of getting to raw results. You’re looking at dashboards that are being constructed on aggregates. And every dashboard is like a tombstone of a past failure. A past outage, a past event where you were said, “I’m gonna construct a dashboard, and it’s going to tell me exactly how to find this problem the next time.” And it does, but you have new problems all the time. So your past is just littered with dashboards.

People need to start thinking about starting points, about questions that they can ask and iterate on. In lot of ways this is not new computer science. This is a very BI-type approach to problem solving. Instead of trying to predict what problems you’re going to have and then staring at a dashboard that contains your prediction, which may have been days, weeks, months or years ago, instead you could ask questions iteratively. You start with something simple, “And then what about this? And what about that?” And you start following this trail of breadcrumbs that you or your team has laid down.

I’ve seen teams switch from dashboards to these iterative, interactive tools. The secret then is that you become a better engineer. Data-driven debugging is a way of taking small guesses about the universe and testing them, in real time, and it’s so much better than trying to predict what you’re going to need to know months in advance, and then rely on these past predictions.

Ingrid: I’ve been thinking about that as well. When something breaks, you have the tendency, after you fix it, to go in and add more checks, more tests and more logging, and more dashboards, to make sure you are never going to run into that problem ever again. Which is never true, it will always happen again. For me it’s about being deliberate enough what you really need, rather than just jumping into this universe of possibilities, just because they’re available to you.

Charity: It’s a good instinct. We want to make it better for the next person to stumble across this problem. But, we have to get comfortable with failure. We have this very rigid, brittle approach to failure. “It happened! Oh my God! Let’s make it so it can never happen again.” Well, it’s going to happen again, and it’s not going to look exactly the same, so you’re not recognize it in the same way.

As a rule of thumb, and something I find really insightful, a friend of mine said, “In the future treat all of your systems like distributed systems, and you’ll be mostly okay.” Whether you think you’re running a distributed system or not, when you think about it, it’s totally true. The computer science that deals with distributed systems is largely around complexity, and our systems are getting way, way more complex all the time. If we’re brittle, if we’re afraid of failing, if we’re trying to make it so that we can never fail, that means that when we do fail, we fail hard and we aren’t used to it. We don’t have the tools to deal with it.

The shift I see really mature teams undergoing is instead of monitoring on everything, instead of paging people on symptoms, instead of being really paranoid about things breaking – don’t get me wrong, your taste for quality can still be high – they just start thinking about when it breaks, what happens? Let’s break it this way, we’re going to surface all this detail, and we’re going to ask questions all the time. And honestly, you need to page yourself a lot less. The vision I see for the future is that the only things you page people on, the only things that you wake people up for during the night, are when your customers are hurting. That means all that you need a paging alert on is the traditional correctness, error rates, request rates and latency. That’s it.

The engineer/manager pendulum

Ingrid: You recently wrote a really passionate blog post about something that you described as the“engineer/manager pendulum”. Can you talk a little bit more about this pendulum. How has it influenced your career and the way you’ve approached switching from a contributor to management?

Charity: I’ve always been a fairly reluctant manager. I love engineering. I find joy in my job when I’m riding my bike home, and I’m replaying the day, “I fixed this, I solved this, I learned this.” My wife will love to hear this, but I’ve never known as much joy. I love that feeling! I don’t feel that way about management. And yet I find myself in management over and over, and this always feels like the right decision. I love joining a company as their first infrastructure hire, taking them through building all the early stuff, and then hiring a team who takes over from me. I love that.

But recently, a couple of my friends seem really unhappy about it. I wrote this blog post for them, because I wanted them to feel okay about their lives. I was surprised it struck as much of a cord as it did.

When I think about the managers that I’ve had, who I really respect, who I really learned from, there aren’t many. The ones that did help me were people who were recently engineers. They still had a very visceral, causal link to their engineering careers. Tech leads, if you aren’t hands on, your technical skills atrophy. They just do. When I was on Facebook I got myself taken off the on-call rotation, finally, and I couldn’t believe how quickly I felt like I needed to step out of the technical discussions, because my information wasn’t up to date and it wasn’t the best information.

The best managers were recently engineers, and the best engineers have always spent some time in management. I also really want to lower the perceived bar for what being a manager is. It’s not this mystical career. It’s not like, “I’m powerful now, I’m a manager!” It’s just a parallel career track. I really believe in paying engineers and managers the same. At Slack, the engineers make more than the managers do. That makes a lot of sense. There’s a leveling mechanism. We should be able to go back and forth, regularly, and it shouldn’t be seen as gaining and losing status.

The managers that did help me were people who were recently engineers.

So many engineers only go into management because they get sick and tired of being left out of the decision making process. That’s usually been my impetus. I’m like, “Screw you! If you’re not going to invite me into the room where these decisions get made, I will make myself a manager so you can’t leave me out.” And that sucks. That is not actually what you want the motivation to be for the people who are your support apparatus, who are supporting the engineers on your team. You want the motivation to be, “I want to deal with people problems.” Otherwise, you get reluctant managers.

You want people to go into management with the right expectation and the right motivation, and that is not the drive for power. There are two ways of becoming manager. There is becoming a manager, leaving hands on technology behind and climbing the ladder. And then there is, “No, I want to stay close to the tech. I love the tech, I want to go back and forth.” Both of those are legit. But if you want to go up the manager ladder and climb track, you need to be in a big company, or starting something, because you need there to be a progression. They’re the only ones who can be middle managers. Startups don’t need middle managers, and that’s good. We need executives who come from tech, and who are good at that, and who are out for our best interests. They cannot keep going back to being hands on. What you don’t want to be is someone trying to straddle those two worlds, and being someone who has been a front level manager for more than five years, not climbing the ranks, while simultaneously becoming very stale in your technical needs.

Bringing choice to technical career progression

Ingrid: So how do we demystify career progressions in tech and create opportunities for engineers that don’t make them feel forced to join the management track because it’s the only option?

Charity: That’s such a great question, and I’m thinking about this right now. I’ve always been good at recruiting. I brought a third of Parse’s engineers in. I think that I’ve always been good at recruiting because I don’t ask people to do the things that they’ve already done.

If I’m going to be completely honest, one of the reasons that I started (Honeycomb) is because I was so insulted by the jobs I was being offered when I left Facebook. Nothing against those companies, they’re great companies, but they were offering me jobs to do the same things that I had already done. I wanted to go on and do bigger,and better and new things, and I want to work with people who feel that way too. I want to work with people whose eyes light up at the idea of a challenge. When they’re like, “Oh my, I don’t know if I can do that. That’s terrifying. I’m in!”

I feel like most hiring managers, most people who are trying to build teams, are too conservative about this, because it’s risky. It’s always a sure thing to ask someone to come to your company and do the things that they did at their last three companies. The human potential is vast, and most of us are rarely asked to step up and do bigger things. I have a lot of respect for people who do these big things, and who kind of shoulder muscle their way into bigger things too.

When I’m looking at hiring someone, I always think to myself, “Where would they be in a year?” Managers don’t do this enough. Managers don’t think about, “What can I be giving my people to push them, and drive them.” If you’re a people manager, that’s your one job. If you’re a tech lead, your one job is the product, the technical problem in front of you. And if you’re a manager, your one problem is the people underneath you. The people in my team, how are they growing? How am I pushing them, how am I asking them to step up? We hunger for this. We complain about it, but we hunger for it.

So for an engineer, the caliber of the problems that you give them, how much pre-digestion are you doing? There is a real art to being a tech lead, to giving every person on your team something to do that pushes them, but doesn’t overwhelm them. And then giving them the nudges that they need to do something that is new and hard. A lot of this is not visible. There’s this visibility gap between management and engineers.

I’m sure you’ve had the experience of having a tech lead who gives you exciting problems, who can put themselves in your shoes and just go, “Ingrid’s going to love this, and it’s going to be really hard.”

Ingrid: Yes, it’s the best experience. We’re here to grow. I love learning new things, so I don’t ever want for that to stop.

Charity: We all show up every morning for a combination of something greater than ourselves – making the organization succeed – and something that feeds ourselves, which is pushing ourselves and learning something new. This is a very personal process and individual process, which is why those frontline managers are so key to every organization’s success. A frontline manager who cares about crafting problems for each person that push them, is going to be someone that people are loyal to and don’t want to leave. This, over the long run, is really amazing for the organization. But it’s not a process that you can automate. It’s not a process that can be generalized. There’s always going to be a vast amount of creative work in it.

Ingrid: This makes me think of f how much I love being wrong and how much I love learning new things. When I’m wrong, and I can work out why and how to make it better, and I get feedback from my tech leads or my managers, it makes me a better engineer at the end of the day. It’s the reason why I love tech. It’s this balance of, I don’t know everything, and I’m learning constantly. Then I get to really kick-ass.

Charity: Yeah, and having the safety blanket that if you totally fail, it’s fine. People are going to catch you – but not coddle you. It’s a really awesome needle to thread.

Creating a culture of cross-pollination

Ingrid: How do we build a team that is open to the value of diversity, not just culturally, but also to the point where security engineers work together with frontend engineers and operations specialists, and customer support. What does a future built on those values look like?

Charity: I hate the monoculture that arises from a team that sees nobody but each other every day, and works with nobody but each other every day. I’m a big believer in mixing it up. Have everybody sit with everyone. Having my Ops guy, Ben, sit next to my salesperson has resulted in some of the most entertaining and creative ideas, which have been amazing for us.

You’re going to get more of whatever behaviors you praise.

We like to hire adults at Honeycomb, and that means people who know exactly how hard other disciplines are. You can know that from an academic perspective, and you can know that by trying to walk in their shoes. Trying to do sales for the past year at Honeycomb, oh my God, I know how hard sales is now.

We’ve started looping our sales guy into engineering now. He’s writing code! I love this, I love cross-pollination. This is where we find our most creative ideas. It’s where we come up with the most unexpected questions, and asking unexpected questions is a great way to make everyone tickle that creative part of their brain.

To answer your actual question, there’s the baseline don’t be a jerk, don’t be Uber, that’s pretty obvious. But often we get so, so hellbent we have this paranoia about not achieving our goals, and so we get tunnel vision. As leaders, we act like what we want people to do is not bring their whole selves to work, but come in and just go, heads down, write code all day. Nobody can do that for a long time, and it starves the pieces of you that are human. Ask people to bring their whole selves to work, understand that people cannot do more than four to six hours of hard-thinking work a day, and encourage that.

You’re going to get more of whatever behaviors you praise. We have to be really careful what we praise people for. If you’re praising people like, “John was up all night on this problem, round of applause for John,” don’t do that. That’s what you’re going to get more of. That’s what you’re encouraging. That’s what you’re saying you value. It’s a lot harder to look for the successes you’re therefore not seeing, because the system didn’t go down during a release.

Praise people for taking their vacation. Say things like, “I can see how refreshed you are. And it’s so nice to see you sparkling and energetic. You were looking really tired. Thank you for taking care of yourself,” and be genuine about it. Praise things that you really do want to see more of. The stuff when people are killing themselves, I’m not saying you can’t praise them, but don’t praise them publicly. You can thank them privately, and pair it with something like, “And please take Friday off.”

We have to see each other as human. We have to acknowledge our need for balance, because all of these things are necessary precursors towards a culture of play. A culture of play and collaboration, and willingness, and ability to take a flying leap that may or may not succeed, but you want to try it, and you think it could be amazing. You need the safety to fail sometimes. Encourage healthy work-life balance and try to cross-pollinate your people as much as possible. And then wait for their natural human creativity to jump in and take it that last five miles.

Ingrid: That’s such a great answer, because at the end of the day, we’re all humans, we all want to do a good job and be respected, and be accepted, and feel like we belong.

Thank you so much for chatting with me today, it was great to have you on the show.

Charity: Thank you for having me.

If you’re interested in joining the R&D team to help build Intercom, check out our current openings here.