#22 - How we evolved our cloud infrastructure at HashiCorp | Michael Galloway, Director, Platform Engineering at Mercari

Insights to sit on cloud nine

and

Aug 01, 2024

Welcome to edition #22 of our series of in-depth guides for engineering leaders, by engineering leaders. Today, we invite Michael Galloway, Director of Platform Engineering at Mercari, and ex-HashiCorp / ex-Netflix, to share a case study on how he and his team evolved HashiCorp’s cloud infrastructure. For once, we’re testing a shorter and more straightforward format. Please let us know in the poll after the article if you liked it!

Before we get started, make sure you’re subscribed to Plato’s Newsletter by clicking the button below:

Michael Galloway, 25 years of experience leading engineers

Michael has a wide breadth of leadership experiences, from his startup to companies like Yahoo!, Netflix, Doma, and HashiCorp. In many of his feats, he’s led platform engineering initiatives, defining the architecture and solidifying the cloud infrastructures or deployment workflows. He has a long span of experience building and managing teams of engineers and happily shares his wisdom with Plato mentees since 2017. In 2024, Michael joined the Japanese e-commerce company Mercari to lead the whole platform engineering organization and improve the service for tens of millions of monthly users.

You can follow Michael Galloway on LinkedIn.

How we evolved our cloud infrastructure at HashiCorp

Let’s say you’ve got a big, audacious, cross-functional project to accomplish. How do you make sure it actually gets done?

Based on my work as a platform engineering leader in several companies, I’ve identified three essential ingredients: a clear objective that helps you focus and prioritize, a deadline that pushes you to make the right trade-offs, and executive buy-in that ensures the priority of your work is maintained.

In this article, I’d like to give you some context on the situation at HashiCorp when we faced a major project, share the strategy we put in place, and offer a brief glimpse into the specific tactics that helped us achieve it.

And because I’ve seen this formula work in several companies and situations, I’m confident you can apply these lessons to your organization, too.

Background

I'm Michael Galloway, currently the Director of Platform Engineering at Mercari. The following story is based on my previous role as Director of Platform Infrastructure at HashiCorp, a company known for tools like Terraform, Consul, and Vault. Over the years, HashiCorp invested heavily in its cloud offering, HashiCloud Platform. My organization was responsible for the infrastructure that underpinned that offering.

The challenge at HashiCorp

Initially, HashiCloud's architecture supported enterprise customers with a simple control plane/data plane design for administrative purposes. As teams began adding direct customer-facing and high-security workloads, this design became inadequate.

Our Infrastructure organization tried to make incremental improvements while managing current demands, a classic "fix the engine while flying" scenario.

In fact, in one instance the Infrastructure organization had to turn away a new product because our current infrastructure couldn’t accommodate the level of security that they required. This resulted in the product team deciding to set up novel infrastructure that later became a maintenance burden, as they were not infrastructure experts. Clearly, this was a bad situation for them and the business.

Our Infrastructure organization tried to make incremental improvements while managing current demands, a classic "fix the engine while flying" scenario. However, limited bandwidth made progress slow and mostly tactical.

We needed a full redesign, but how could we secure the necessary time and resources?

The strategy: The 3 key ingredients for big projects

To address this challenge, there were three things we needed:

A clear objective that everyone understood, so we could focus effectively.
Deadlines that would push us to make the right trade-offs.
Executive buy-in, so we could execute with the support of the business.

A clear objective that everyone understands

We needed a specific goal and success criteria that were unambiguous to enable us to focus effectively and give us a way to prioritize our work. On the surface, this sounds simple. But in practice, it can be very challenging to define objectives that everyone in the engineering team can understand in the same way and that also align with outcomes that matter for the business.

“You don’t need people to work more, you need people to work on the right things.”

For an objective to be clear, it should be defined as:

What we are trying to achieve
What will be measured
A clear definition of “done”

Outlining these points ensures they are focusing on the work that matters. As Christina Wodtke wrote in Radical Focus, “You don’t need people to work more, you need people to work on the right things.”

How to design clear objectives

A classic strategy that has worked for me is to start with the end in mind, as Stephen Covey describes in 7 Habits of Highly Effective People. What is the experience or result that you need to see? What are the capabilities that you need to have?

From there, the next step is to turn that definition into a set of criteria that can be demonstrated to consider the goal achieved. For example, how do you know you have that new capability? It is because you can do X within Y time. Finally, set specific targets that define your “done” status. That is, you need to define your “X” and “Y” from the earlier example.

Product management often plays a lead role in this work, but I believe it is still vital to include engineering in this definition work to ensure it’s grounded in technical reality and achievable.

For further reading, I’d recommend Measure What Matters (summary notes), and Radical Focus (summary notes).

A deadline that pushes us to make the right trade-offs

The essence of what a deadline is useful for, described by Adam Savage from Mythbusters.

In addition to setting a clear objective, we also needed a deadline associated with it.

Having a deadline forces you to operate with a critical constraint: time. Using time as a constraint will encourage better prioritization, reduce the risk of trying to boil the ocean, and signal to everyone when to expect things. For peers and within teams, it can also lead to the right trade-off conversations.

Using time as a constraint will encourage better prioritization, reduce the risk of trying to boil the ocean, and signal to everyone when to expect things.

I think Adam Savage of MythBusters said it best: “Deadlines refine the mind. They remove variables like exotic materials and processes that take too long. The closer the deadline, the more likely you’ll start thinking out of the box.”

To me, the single most important thing a deadline does for us is that it creates a sense of urgency. John Kotter talks about this in Leading Change. He highlights that urgency is the first, most important criterion for combating complacency and distractions in an organization. The lack of urgency is a common reason why big projects fail to complete, or even get off the ground in some cases.

How to avoid unrealistic deadlines

After sharing this point in my presentation at Elevate, I was asked: “What if you pick a deadline that is unrealistic?” It’s a great question. Nothing is worse than an unrealistic deadline.

In my experience, the key to picking the right deadline is to involve your team(s) in setting it. Generally, the process teams break big projects into smaller problems and leverage historical examples as proxies. There is almost always a buffer factor that is needed to consider unknowns. Tasks can then be ordered by risk to reduce the unknowns earlier in the cycle, and therefore increase your confidence as you execute.

Deadlines need to be realistic, but ambitious, and should be dates you are not afraid to advertise. By deeply involving your teams, it ensures both buy-in and accountability, which are crucial for a deadline to work as a motivating factor, rather than a demotivating one.

Executive buy-in that allowed us to execute with the support of the business

Executive buy-in is perhaps the most critical ingredient for success. To quote John Kotter, “Major change is often said to be impossible unless the head of the organization is an active supporter.” Leadership buy-in enables you to say “no” to competing priorities and supports your requests to other teams when there are deadlines you need them to meet.

Leadership buy-in enables you to say “no” to competing priorities and supports your requests to other teams when there are deadlines you need them to meet.

A case in point: In a previous role, we needed to migrate our fleet from Heroku over to Azure. We got commitments from teams nine months before the actual move needed to happen. When it came down to the quarter that the move needed to happen, lots of other priorities popped up on their roadmaps. It was only because we had executive support that we were able to push the project forward and successfully achieve the migration.

How do you get executive buy-in?

You need to start with understanding the objectives for the business, and more strategically, the specific objectives that executives care about. Then, identify ways that your project can further those objectives. Work with your leadership to tell that story, and if possible, meet directly with key executives to create the link. If the change you want to make is big enough, you’ll need this support. Especially when it comes time to say “no” or push other teams for asks.

Margit Mansfield wrote a good article on strategies for getting executive buy-in on large projects.

In the Platform Engineering space, it can be especially hard to get upper leadership buy-in for re-architecture projects. It is often the case that platform work is seen as a cost center, rather than a revenue enabler. This is one of the many reasons why getting very good at understanding the business objectives, and how platform work fits into it, is so vital for a Platform Engineering leader to learn.

Case study: HashiCorp's Western Pioneer project

Our Western Pioneer project at HashiCorp, to stand up a new region in the US West for our servers.

A quick reminder of the situation at HashiCorp that led to this big project:

We were stuck.
We had a design that wasn’t scaling and the business was getting bogged down.
We tried replacing pieces of the plane, but there was not enough time, because it wasn’t a clear priority to do so. This meant we had a lot of partial efforts and half starts, something we’d come to think of as a “garden full of weeds.”

In early 2022, the business said we needed to set up a new region in the US West to use as a disaster recovery region. And shortly after that, we would need to implement new regions in the EU—all within 12 months.

This was our opportunity… we had a strategic objective that the business needed us to achieve. This was our chance to connect our project (redesigning our infrastructure) to a strategy that the executives cared about.

We established a project called “Western Pioneer” and after an initial bootstrapping phase to PoC some early ideas, we set a target of having a second region stood up that could support failovers of basic services within six months, and a European footprint to follow in the six months after. This plan included our strategy of starting a fresh, new design that would scale in the US West region, and then reuse that for Europe.

Instead of changing the engine while flying, we were going to build a new airplane.

This gave us all three ingredients:

Clear objectives - Stand up regions in the US West and the EU.
Deadlines - We had 12 months to complete the work.
Executive buy-in - Disaster recovery and the ability to sell in the EU were two major business goals. Our redesign project was now critical as without it we couldn’t manage the complexity across all the regions.

Execution: How we went about it

We prioritized the project at the top of our plans.

With a clear target in mind, we assembled the key team to execute the project and ensured this effort was funded. These were folks who already knew the problems well, and so quickly began implementing decomposition strategies to break apart our existing infrastructure into composable components, allowing us to define a new region flexibly. Before this project, they did not have a clear goal for a decomposition effort, but with the need to create a new region, the purpose was clear.

Other members of the organization, not specifically dedicated to the architectural work, built new tooling to support automating the testing of our newly versioned components. This was a big leap forward for the organization and would open up several innovation possibilities, like blue-green deployments for our infrastructure. It was also another example of where having a clear objective helped align multiple work streams that may have otherwise focused on separate use cases.

We consistently shared our progress as we executed and reported against our timelines.

Along the way, we had to make some tough decisions and trade-offs. How much should we automate? Can we backport changes to the US East? Can we support different database vendors?

Having a deadline pushed us to be pragmatic and to focus on delivering rather than perfecting.

This is where deadlines became critical for us. As we executed, we had many discussions on what was in scope and what needed to be postponed. Like most projects, we had an imperfect plan that relied on us having to make decisions after we understood the work better. Having a deadline pushed us to be pragmatic and to focus on delivering rather than perfecting. Reporting out in our monthly newsletters also helped keep us accountable, as we publicly communicated our progress against our goals.

We broadcasted our successes to the VP.

To keep the momentum at all levels, we broadcast our wins. As the executives could see progress that directly moved the business forward, towards goals that they cared about, it was easier to make the case for the resources that were required and to push back on additional requests that we could not easily take on. This was a significant difference from where we were in the past.

Our Cloud Platform VP especially took an active interest in our project and would continue to share the progress with the rest of the executive staff. I believe this had the added benefit of not only reinforcing the executive support that we needed, but it gave us an added sense of responsibility to make sure we delivered on time and as expected.

Results

From a garden full of weeds to a magnificent, clean park!

We hit our six-month goal of getting a second region in place, and then finally landing a footprint in the EU in the four months after that. We delivered on our main goals two months ahead of schedule and substantially upgraded our ability to scale and operate our infrastructure for the business.

Conclusion

Clear objectives, deadlines, and executive buy-in were crucial to our success at HashiCorp. These elements ensured consistency, urgency, and support, allowing us to deliver ahead of schedule. In my experience at Doma and now at Mercari, this combination of ingredients has consistently led to successful projects. Without any of these elements, projects are likely to fail.

For further reading, please check out the following:

Leading Change: Why Transformation Efforts Fail - John Kotter
How Big Things Get Done - Bent Flyvbjerg and Dan Gardner

And that’s a wrap! Let’s thank Michael Galloway for his wisdom. And before you go, please let us know what you thought about the article’s length!

Cheers,

Quang, JB, and the team at Plato & Coda.

Plato Community

Discussion about this post