#4 - Unpacking Okta's Tech Architecture | Monica Bajaj - VP Developer Experience & Mark T. Voelker - VP Architecture at Okta
Whatever good things we build, end up building us.
👋 Hey, Quang here! Welcome to the fourth article in our series of in-depth, actionable guides, where we share the wisdom of engineering leaders we’ve met in the past eight years while building Plato. If you’re new here, subscribe to never miss a post!
For this article, we invited Mark T. Voelker and Monica Bajaj from Okta, to share their take on their tech architecture. Let’s start with some background👇
Monica Bajaj, VP of Engineering, Developer Experience & Mark T. Voelker, VP of Architecture and Chief Architect at Okta
I think I’ve known Monica Bajaj for more than 7 years now, ever since she joined as one of the first mentors at Plato. She’s always been an advocate for equality in tech and has recently received the Women In Tech ERG Leader of the Year silver award, alongside many other trophies over the years.
Hundreds of calls and happy mentees later, she’s still here, sharing her wisdom on stage at Elevate 2023, with her colleague Mark Voelker!
At Okta, Monica is the Vice President of Engineering, Developer Experience for Customer Identity Cloud. In her role, she focuses on building a developer-first, frictionless, extensible identity platform that is highly scalable and secure, catering to billions of monthly logins for customers and partners.
Mark T. Voelker, as VP of Architecture and Chief Architect at Okta, plays a pivotal role in scaling the company’s different business units, guaranteeing a seamless and optimized evolution, both in terms of infrastructure and people.
Okta is the front door of many applications - with its digital identity service that includes single sign-on or multi-factor authentication (MFA): any interruption could impact thousands of external services. Over the last 10 years, Okta has had to adapt to new usages, such as passwordless authentication and the need for reinforced security, while scaling its tech.
In a multi-cloud environment with several different applications, multiple countries to serve and multiple laws to comply with, architecture became one of the most important problems to tackle. As the company grows, complexity increases and the right answer to these new challenges demands both technical acumen and great leadership.
Monica & Mark took the stage at Elevate to share their battle-tested advice last November, and what follows summarizes their insightful talk.
You can follow Monica Bajaj and Mark T. Voelker on LinkedIn.
How we do Architecture at Okta
The Architecture Charter
We don’t know if you’re into architecture, but if you compare the works of Frank Gehry and the achievements of Le Corbusier, they are nothing alike. Both marked an era, but their philosophies, inspirations, means of working, and influences are quite different. The same applies to tech architecture!
Some companies have a dedicated architecture team. Sometimes, architects are scattered throughout different units. And some other organizations have well-defined areas of responsibility, reporting to a sole executive.
At Okta, the architecture charter has a dual purpose:
Level up the tech stack
Level up the people stack
While most companies solely focus on scaling the tech, Okta believes that technical excellence and a strong people organization are intertwined.
Let’s try to understand why they chose this uncommon, dual-pronged charter:
Leveling up the tech stack
Okta is in the login business. As it handles billions of logins per month, it becomes the primary gateway for thousands of applications, be it medical records, government services, payment services, or Fortune 50 giants.
With so many external services relying on Okta as a SaaS Provider, it is incumbent to build software that is resilient, scalable, and efficient across hundreds of environments, and at a global level.
But a system is not the sum of its components. Mark Voelker framed it perfectly:
“It’s a car that takes us places, not a gear that spins reliably at various speeds and is cheap to make.”
At a scale like Okta’s, you no longer have the luxury to think in terms of individual components. The responsibility of the architecture team is thus to ensure that the interactions between all the components are seamless and that they’re building a system.
“Architecture is the study of how the components [of a system] interact.”
- Mark T. Voelker
Components taken individually are important and must be optimized for scaling and efficiency, but they are only a cog in the bigger machine. If the interactions between them are broken, no matter how good a component will be, the system will fail.
Leveling up the people stack
Now, when most people think about components, they’ll list applications, APIs, tech infrastructure, load balancers, DNS servers, databases, logs, and so on.
But as Mark would say:
“There’s one component present in all of your services that you can’t forget: the people behind them.”
We think about software in a deterministic way: you give it an input, and it gives you the same output every time. But there’s one variable we often overlook: people. And particularly in a SaaS provider where the people producing the software are also operating it, their outputs can vary greatly:
“The output that you get from your engineers that are building, operating, and maintaining services changes literally depending on what side of the bed they got up from in the morning, and whether the coffee was strong enough.”
Humans are the wildcard component of the system: new people join the organization, others change roles, and some leave. The organization in itself evolves, and depending on how well it was constructed, people’s reactions can vary greatly. Building a robust long-lasting architecture is thus as much about scaling tech as it is about leveling up the people stack. Both are crucial components of a bigger system, and their interactions must be thoughtfully studied and optimized.
That was it for the theory. Let’s now dig deeper into how Okta levels up both its tech and people stacks.
The Nuts and Bolts of Okta’s Architecture
The architecture charter is like a manifesto that sets the broader principles of a smooth architecture organization. Here is how Okta uses it in practice:
Getting your hands dirty
When designing your architectural organization, the first question to answer is: what defines a successful architect?
Understanding the context of the current state of the system, how they got to this state, where they plan to head, and what are the constraints are some of the key aspects of an architect’s role.
And to understand things perfectly, architects must work very closely with teams. For Mark Voelker, successful architects “have their sleeves rolled up, and their hands dirty”.
When we think about construction architects, we often imagine solitary people locked in their ivory towers, drawing unseen concepts and selling a new vision. But the truth is, it would be difficult for an architect to create something sustainable if they weren’t - at least a little - grounded in reality.
The same applies at Okta: architects work side by side with engineering teams, they’re embedded within those teams, go to staff meetings, understand the code that lies within a feature, and work hand in hand with project owners from the conception phase to the release in production.
This path isn’t the most common among organizations, but it gives architects some crucial knowledge about the project’s context. What is the current state of the art? How did we get there? Where are we trying to go? Which constraints do we have to work with?
“The best way to help architects get context isn’t to have a project where somebody comes and presents for two hours and goes through 40 slides…and is expected to make a good decision out of that. You get a lot more context if you’re working hand-in-hand with those teams, day-to-day.”
- Mark T. Voelker
Here is what the organization looks like at Okta. First of all, there is a core Architecture team, led by Mark, who focuses on the bigger picture themes, such as resilience, scale, cost or operability. The architects have an overview of the whole engineering organization and look at all the components and how they interact.
The architecture team doesn’t “own” particular codebases or services: the charter involves delivering on themes that span across the whole codebase and organization, to be more aligned with overall company or business-unit goals. For example, when Okta defined resilience as one of the major focus areas for the architecture team, it involved working with multiple teams to improve dependency handling and develop graceful degradation strategies, while also looking at data quality and trends from incident reports. Scaling strategies brought a 6x improvement in request-per-second capacity in around 6 months, while the cost-efficiency focus helped bring millions of dollars in cost reductions.
And one of the important aspects to note is that both of those themes weren’t evolving in their own, separate “swim lanes”: as architects had the big picture, they could optimize for several of those focuses at a time.
Architects from Mark’s team are regularly rotated in different domains and projects, to ramp up more quickly on a broader set of issues and topics and get a broader context about the whole system.
In addition, each business unit has its domain architect: these employees do not report directly to Mark but to their domain leaders instead. They are experts in their domain and have a bit of insight into the broader picture, which helps all domain leaders understand the possible consequences of a decision they make on parts of the system they don’t work on.
Having architects work closely with teams helps them grasp technical issues and how people react to them: what can excite them, what blocks or frustrates them, etc.
Both domain architects and members of the overarching architecture team share a vision and common goals. Rather than having different agendas all of the company’s architects align on the endgame.
The separation of responsibilities then varies from team to team, depending on the current needs. This goes back to the concept of leveling up the people stack: a seasoned team that knows what needs to be done may just need help choosing one of the different options it came up with. Some other teams may need help writing the code. Others may need a kickstart and will ask for help designing POCs, running tests, or conducting research.
Responsibilities are also naturally shared depending on context: teams are usually concerned with the piece of the system they own - they are deeply involved with the internals of that specific piece (service, components, dependencies, etc.) but may not have broader context on the rest of the system. This is where embedded architects can share their global knowledge.
There’s another positive side-effect to the embedded model. Okta believes it is crucial to foster less "formal" forums for discussion that have a really low barrier of entry. It gets ideas into circulation, even just informally. Having an architect in your team's Slack channel and team meetings means you can throw out an idea or a strawman and iterate on it. The architect you're discussing it with probably already has context on what's driving it. To that end, Okta also has offsites for their tech leads and architects to get together face to face to breed that familiarity and level of comfort for having ad hoc discussions, and Slack channels where tech leads and architects can converse together on big-picture items.
This organization tackles both goals of Okta’s architecture charter: leveling up tech, and leveling up people.
Setting Guardrails and Handholds
To avoid coming off the road, the architecture team sets guardrails and handholds. It helps engineers make enlightened decisions by themselves, taking into account.
The Engineering Radar is a document that describes, on the one hand, the technologies used by Okta: their use, their current status, speed, complexity, impact, and scope. On the other hand, it also includes practices and principles Okta believes are ready for production use, experimentation…or deprecation. Picture this common case: an engineer is pushing for a certain technology they like or think will be useful for Okta - but the DevOps team hasn’t created the right environment for that technology to be used yet. Or another new shiny language that isn’t mature or presents security flaws. The Engineering Radar acts as a guardrail to avoid unnecessary debates:
“The Engineering Radar helps us focus as an organization around tech stack and approach to our code that keeps it from looking like a pile of odd parts. You don’t have to wonder what message queue you should use, how you should store configuration, or what the golden rules for how to use a database are: it’s in the radar, and you’re introduced to it as soon as you join the company.”
The principles, technologies, and practices in this radar are the north stars for Okta’s engineering day-to-day projects and longer-term vision. The goal is to make decisions simpler and avoid analysis paralysis, all while staying aligned with business strategies.
Mark and Monica recognize there is no “silver bullet” for solving every single problem, which does not prevent them from giving opinionated guidance for software development... The radar focuses on high-level topics.
Okta’s radar is divided into 3 top-level sections:
Principles (e.g. software design, coding practices, operations, etc),
Technologies (e.g. databases, messaging, testing, observability, infrastructure, storage, secret management, languages, and frameworks, etc),
Practices (e.g. golden rules for using our various databases, logging patterns, API guidelines, guidance on retries and failure domains, etc).
Within each are some subcategories, and within those we list specific items and their current state along with a very brief, easy-to-read description. For example: in the Technologies section, they have an entry for PostgreSQL. Its current state is "adopt" and they have notes about specific cloud provider implementations they use, what drivers they use for various languages, how they handle migrations, and links to useful tools like documentation and schema tools.
The radar is used by all of the Engineering team, and getting familiar with it is part of their new hire onboarding program. Anyone can suggest updates or additions, and these are reviewed by the Architecture team. Okta often involves tech leads and/or engineering leaders when making state transitions (for example, when moving technology from "adopt" to "hold" or from "trial" to "adopt").
Besides the engineering radar, Okta holds periodic Architecture Roundtables: architects and engineers from diverse units get together to discuss complex topics or try to tackle difficulties in current projects.
This roundtable is a council of architects but is NOT an approval body. Remember: an architect is already embedded inside teams, and the prep work has already been done. People join in to get feedback, different perspectives, or just a bigger picture of how their work can impact other teams or pieces of infrastructure so that they make enlightened decisions by themselves.
One last thing is to leave room for POCs and experiments, which helps people provide guardrails and handhols, in a safe environment to test, learn, and share their feedback. Those POCs are usually closely linked to the rest of the process: if an experiment shows that a piece of technology is well-suited, they’ll update the engineering radar accordingly or discuss the learnings during the architecture roundtable.
Yet, implementing these guardrails would be a vain effort without another practice: writing things down.
Write. It. Down.
Auth0 (Okta’s identity unit’s former name) was created more than 10 years ago, in 2013. It has been built by hundreds of engineers from all over the world. Some of them are no longer at the company, while new engineers joined in the meantime. It’s thus vitally important to write down the decisions that were made in the past and got Okta where it is today, for current and future engineers to understand why things are built the way they are.
Writing things down is so important for our Engineering culture that it’s literally printed on our architecture team t-shirts and ingrained in our Engineering culture.
(Note by Quang: I need one of those t-shirts 🤩)
Some techniques used by Okta to ensure a good document include Requests for Comments (RFC), Requests for Discussion (RFD), and RAPID decision-making. We could write an extensive post just on those 3 frameworks, but resources about it are plenty already. We’ll thus be brief and share useful links at the end of this post if you want to learn more.
Requests for Comments are a formal process for proposing, reviewing, and refining engineering ideas, making sure each of them is collaboratively improved and generates consensus.
Requests for Discussion are more informal documents where people can write down their thoughts, ideas, or suggestions for enhancing the system in a less polished (and sometimes, more philosophical!) way, to spark an in-depth technical discussion. Those discussions will sometimes converge into particular decisions.
RAPID decision making: as coined by Bain & Company, RAPID stands for Recommend, Agree, Perform, Input, Decide. This framework helps identify the role of each stakeholder in a particular decision. The team at Okta uses RAPID every time they need to make a technical or tactical decision that involves more than just a small number of people. When the decision impacts or needs awareness from multiple stakeholders, RAPID is thus the way to go. Here is a quick summary of how RAPID is articulated:
Recommend: who defines the first recommendation for a specific decision (based on inputs or not)
Agree: who makes sure the recommendation takes some specific requirements into account
Perform: who performs the decision, and who is accountable for their implementation.
Input: who can provide expertise (and which input they bring)
Decide: who makes the decision and commits the organization to action
DACI decision-making process: another decision-making process standing for Driver, Approver, Contributor, Informed, sometimes used as an alternative.
Next to these frameworks, actions are documented in how-to’s and scratchpads, for all team members to understand and quickly replicate battle-tested practices.
“We have scratch pads that other teams can actually learn from and say, oh, I didn’t realize that’s how you go about troubleshooting…”
When you don’t know where to start, you start by looking at somebody else’s work - and even if it didn’t work, that at least tells you which path not to choose.
Writing things down (and reading what’s written down!) helps engineers, rookies or veterans, make enlightened moves and understand the ins and outs of past decisions.
Scaling Architecture in Practice: a Case Study about Okta’s Extensibility Platform
We approached Okta’s Architecture Charter and the Nuts and Bolts of their architecture with Mark Voelker. Let’s now bring these down to earth through a concrete case study with Monica Bajaj, VP of Engineering, Developer Experience, using a STAR approach (situation, task, action, and results).
Situation
Okta's Identity product offers an authentication flow. Given that customers have distinct needs, they often seek to adapt the product to address their specific challenges. Hence, Okta developed a product called Extensibility, that provides…extensibility points to the authentication flow, ensuring a personalized solution for every use case.
With Extensibility, developers can enrich or extend the authentication flow by adding intermediate steps with their custom JavaScript code. Every interaction happening within those extensibility points goes through an engine. This engine - let’s call it an “Extensibility engine”, fuels every interaction and has to be flexible, secure, resilient, scalable, and cost-efficient.
Okta’s main challenge was to scale the Extensibility platform, designing for 10x while building for 3x, to make sure that they could meet the needs of their customers further in the future. So, how did they do this?
Task
This is where the right balance between the Tech Stack and the People Stack comes into play… with appropriate guardrails and handholds!
Building from scratch can be enticing, but the reality is that in mature organizations, most of the codebase is already legacy or well-advanced, and touched by many hands at a time. When Monica’s team decided to take on the challenge, the first task was to understand the current state of the system.
The runtime of the Extensibility engine had been in the product for years - and as with most years-old codebases, it had naturally accumulated its fair share of changes and technical debt over time. As the business grew, traffic volume, reliability, and cost-efficiency needs became crucial topics for Okta. It became clear that this part of the service offering had to evolve to better address those needs.
However, the feature itself has multiple moving parts: there are several services, network connections, containers, or data stores involved.
The challenge was to make sure the team optimized for both cost-efficiency, resilience, and scale. The question hence became: “Where do we start?”
Action
That’s where the architects rolled up their sleeves: they worked closely with the domain architect, the tech lead, and the team’s engineers to understand the overall system.
Documentation was written to list the most important questions, such as:
The current status of the system and organization
Bottlenecks that prevented further scaling
The main KPIs that needed to be measured
Cost limits in terms of infrastructure
Possible solutions
etc.
Once all of these elements were identified, Okta formed a tiger team of subject matter experts (SMEs) which comprised a Senior Architect; and Quality, Platform, and Site Reliability domain experts. The architect was embedded within the domain to solve this specific problem.
To do this, the team had to set the guardrails. Usually, when executing a certain program, you will have to deal with 3 components: Scope, Quality, or Time. While a majority of engineering leaders would battle to pick 2 out of 3, Okta could not compromise on any of those elements.
They thus decided not to reduce the scope of their project and kept all three of their goals: scale, resiliency, and cost-efficiency… while optimizing on time as well.
Instead of looking at those three elements as individual, siloed objectives, Okta imagined them as a Venn Diagram. It was a useful lens: it became clear to the team that the starting point was prioritizing what they could execute in the “center”.
Keeping those guardrails and teams of SMEs, the next step was to go through decision-making processes (DACI or RAPID), adhering to the architectural principles cataloged in Okta’s Engineering Radar, and starting to build POCs. All of this, while keeping a cultural penchant for writing things down, to share learnings and acquire feedback.
While Scale and Resiliency were mostly top-of-mind for engineers, Cost-efficiency was closely monitored by the FP&A (Financial Planning and Analysis) lead, to keep the team aware of costs and overall utilization.
The first immediate milestone started with this first POC.
As you can see in the above figure (1.1), there are several EC2 instances running hundreds of Docker containers. Each container is associated with one tenant (customer). Each instance also has an engine proxy, responsible for managing the life of these containers, and making sure it routes the right tasks to the right containers. It uses a round-robin algorithm to route those requests. There is also a load balancer, that helps distribute the load to the different instances.
The challenge arises when the system starts to scale as traffic increases: the machine can only run a finite number of containers and cannot scale horizontally without becoming particularly expensive.
Keeping the main guardrail in mind - ie. making sure to optimize for both Cost, Scale, and Resiliency - at all times, the next step was to bring scale and cost efficiency in the mix. To do so, Okta added a new component: a Webtask router (see fig. 1.2) sitting before the load balancer, which kept a tally of which tenant goes to which instance, avoiding tenant duplication and bringing cost-efficiency. But this new component added resiliency issues: with only one replica of the Webtask router, if the component went down, the entire system failed.
In the above implementation, Monica’s team was able to meet Scale and Cost-efficiency goals, but not resiliency. The topic was thus discussed during the architecture roundtable: the foundation of the roundtable is built on trust and focus, which helped the team share some context, explain their constraints, and discuss possible solutions.
One possible solution was to add another Webtask router (See Fig. 1.3).
After preliminary tests though, new problems arose: each Webtask router has its “own view of the world”. For instance, the green router will route Tenant 1 to Instance A, while the orange router will route the same tenant to Instance B. This causes even more duplication, which nixes the cost-efficiency found in the previous solution. So, resiliency is fixed, but scalability and cost-efficiency are suboptimized.
To solve this newly-created issue, Okta added another component: an in-memory database holding the routing state of Webtask routers, to create a unified view of all tenants (fig 1.4). .
The in-memory database solved the duplication issue that was caused by the previous iteration, but it brought more scalability issues: the DB has to quickly sync with the state - which may cause delays during high-traffic times. And beyond the fact scale wasn’t optimized yet, the system was becoming complex.
One of Okta’s first principles is to keep the design simple, scalable, and easy to maintain. They thus decided to tweak their assumption of having a single tenant being routed to a specific container and decided that for some period, the usage of two or more containers for a given tenant could happen, as long as there was only a marginal impact on cost..
Okta also moved from a round-robin algorithm to using consistent hashing on the front load balancer (fig 1.5). Consistent hashing is a distributed hashing technique, that minimizes rehashing when nodes are added or removed. It also ensures a balanced distribution of keys, by assigning them a specific position in an “infinite ring”. This enhances the overall scalability while minimizing disruptions in data distribution.
After multiple iterations, Okta was able to find an elegant solution that met both its principles: Scale, Resiliency, and Cost-Efficiency.
Results
This case study showcased the process used by Okta to successfully improve its Extensibility subsystem. It should be pointed out that the company applies this architectural philosophy across the whole engineering organization… and it bears fruit!
In just 6 months, Okta managed to increase their RPS 6.5x, which was more than originally planned.
Since then, the overall capacity has scaled to 20,000 RPS, thanks to a release that took future growth into account. All of this was possible, despite the short amount of time and multiple subsystems involved, because of Okta’s focus on big-picture objects in addition to the short-term business needs.
Its approach to architecture meant that when the urgent need to scale arose, the team already had familiarity and rapport, a toolbelt full of principles and practices, an existing context that didn’t have to be re-explained, and a framework for figuring out what to focus on first.
While the overall goal was to release a successful solution to a technical challenge, it was also to make it sustainable people-wise. One of the first principles to keep in mind is that for such challenges, there is no turnkey solution. It usually takes some trial and error before finding the right one, and that’s exactly how Okta envisioned it! During this entire process of iteration to reach the final solution, the behaviors described above played an important role.
Conclusion
The organization defined by Okta helped grease the skids and ship faster: with a seamless collaboration, the team was able to focus on strategies that met multiple business goals while also growing their own skill for the next big project. Another side-effect was that it created mutual recognition and advocacy among different parts of the hierarchy: architects, leads, engineers, and product owners were all in the same boat and contributed together to the project’s success.
The long-term outcomes of this engagement are significant and game-changing. Here are some of the benefits:
Technical Excellence: having architects embedded in the domains strengthens the posture and partnership with domain architects, scaling architecture thinking within teams so it comes full circle in terms of best practices, technical radar, and long-term strategies. The outcome is a technology roadmap that outlines the product’s underlying technology while preparing for future scale and new trends.
Innovation and Adaptability: it fosters a culture of innovation proactively rather than being reactive, and allows the teams to explore new ideas and technologies at all times. This also allows Okta to pivot when it comes to adapting new technologies and methodologies, and be ahead of the curve at all times!
Risk Mitigation: the responsibility is shared between all stakeholders, without ownership being borne by a sole team or person. Implementing proper development and design strategies is crucial for recognizing and addressing risks effectively.
Leadership: Okta will be able to create the next generation of architects, sparking the interest of current engineers with its current process; all of this, coupled with succession planning and skill development.
So remember: if you want to improve your architecture, you’ll need to level up both the tech stack and the people stack, collaborate (not give and receive orders), make sure things are written down…and roll up your sleeves!
A few external resources
Similar perspectives in the industry:
More about written practices:
RAPID Decision Making - Bain & Company
DACI Decision Making - Atlassian
Request for Comments (RFC) - Triton Data Center
Requests for Discussion (RFD) - Wikipedia
Scaling Engineering Teams via RFCs: Writing Things Down - Gergely Orosz
A list of openly available RFC templates - Curated by Gergely Orosz
Engineering Radar - QE Unit
More about the tech:
Consistent Hashing - Ably
That’s a wrap - Hope you liked it!
Let us know what you think of that article. Was it too long? Too short? Did it bring you value? Do you like the format? Was it actionable enough? What can we improve? We’re all ears 👇
Thanks in advance for the feedback!
Quang, Cofounder at Plato, on behalf of the Plato team
PS: the more, the merrier! If you found value in reading this article, you can share it in just one click: