#18 - How we handle tech debt at Kiddom | Allen Cheung, VP of Engineering
Tech debt is like a friend: it sticks with you at the worst times.
Hey friends, welcome to the 18th edition of the Plato Newsletter! We’re excited to announce that we’ve released the whole schedule for Elevate 2024 (in less than 2 weeks!), and we’re thrilled to invite Allen Cheung, VP of Engineering at Kiddom, to share more about his process for handling tech debt.
In French high schools, we often study one particular book from Voltaire, a French philosopher. It’s called Candide and tells the musings of a… candid person around the world. The book ends with that one sentence: "Excellently observed," answered Candide; "but let us cultivate our garden."
When I read Allen’s article, it made me think of this book. Your codebase is like a garden. It can blossom or perish, depending on how you cultivate it. And tech debt is like the annoying weeds and the overgrown grass you must cut and remove with effort… if you let it grow in the first place. If you want to learn more about gardening, read on. But first, a bit about Allen 👇
Allen Cheung, the best of both worlds
Education is the cornerstone of our future. This is what helps us advance as a society. And it’s something Allen Cheung understood early on.
As Kiddom’s VP of Engineering, he works on innovative solutions for teachers and learners, improving the educational experience and fostering a love for knowledge. And right before that, he was the VP of Eng & Product at Mystery.org a platform that helps kids stay curious as they grow up. As a Plato mentor, he strives to help people grow as persons and leaders.
His approach emphasizes context over control, impact over features, growth over a fixed mindset, feedback over opinions, and progress over perfection. Prior to Kiddom and Mystery.org, he’s had the opportunity to apply those principles at Affirm, as a Sr. Director of Engineering, and at Square and Google, where he was an Engineering Manager and Software Engineer.
You can follow Allen Cheung on LinkedIn.
How we handle Tech Debt at Kiddom
What is tech debt? When do you encounter tech debt and how can you overcome its gravity?
My name is Allen Cheung, and I’m currently the VP of Engineering at Kiddom. Prior, I’ve held engineering leadership roles at various companies like Affirm, Counsyl, Square, and Google. At Kiddom, we use technology to enable teachers and learners by providing a digital core curriculum to school districts.
What do early startups focus on?
For those unfamiliar with early-stage startups and their evolving environments, this is a simplified overview of what ideally happens in the first few years, delineated by stages of investment:
An early-stage startup begins by focusing on getting an idea off the ground, looking to prove a hypothesis about the efficacy of their product until they start seeing market validation. In later stages then, the focus shifts to growth, and doubling down with scale as well as expanding business lines.
At Kiddom, we were starting to look at new markets via adjacent domains as well as geography. Naturally, this expansion of our product brought about tech debt.
What is tech debt?
It’s really around this phase of growth and expansion where tech debt—previously downplayed, or largely tolerated—begins to affect development.
What is “tech debt”? ProductPlan shares this definition:
Technical debt (also known as tech debt or code debt) describes what results when development teams take actions to expedite the delivery of a piece of functionality or a project which later needs to be refactored. In other words, it’s the result of prioritizing speedy delivery over perfect code.
For Kiddom, I’d say we recognized that tech debt is a consequence of prioritizing speed of iteration over technical rigor, in light of continuously changing product requirements.
Specifically:
Kiddom was built to encourage teachers to create their own curriculum. We spun out a set of complex services and apps for that use case, but over time, we iterated on the product until we found that teachers and their schools responded more strongly to high-quality content, authored by renowned curriculum authors.
From a technical standpoint, our databases were optimized for reads and writes equally for moderate amounts of data. The shift to the core curriculum—big, complex documents with thousands of nodes and pieces of embedded assets—stressed out this architecture.
In addition, the stress placed on our systems started to affect our overall stability and timeliness; we needed to be predictable with when we could deliver complex curricula to bigger school districts.
When tech debt flashes warning signs
As engineers, tech debt is forever a bane and a bundle of splinters to be yanked out…but it's unruly to prioritize via bad vibes. It’s insidious by nature: projects weren’t quite delivered on time, finding and fixing issues took longer than they should, and "code complete" was anything but.
A few of the warning signs we experienced:
Longer tests, more bugs, slower deploys
App performance took a hit
Developer velocity and slowed
Employee NPS (net promoter scores) tanked, indicating low morale
Sure, some customer-facing issues were the result of product pivots and the remnants of prior objectives, but what really hurt—and drove us to take action—was our inability to move quickly to fix these problems. Unlike the sharp pain of degradation or outage, this was the boiling frog of slower development and subdued improvement.
Kiddom’s 4-step tech reduction process
Disclaimer: this is still very much a work in progress, and we humbly learn and iterate on what works and doesn’t work continuously.
What we came up with is these 4 steps:
Codifying the problem with our vocabulary
Stopping the world (for a sprint)
Normalizing the maintenance
Planning for bigger initiatives
Let’s take a closer look at each item.
Step 1: “Gardening”: Our own vocabulary
We needed an analogy for non-technical folks to explain why the work was needed. The term “tech debt” is already overused in the tech industry, and peoples’ preconceptions upon hearing that verbiage hindered them from hearing about the real, underlying issues. As simplistic as it may sound, refreshing the vocabulary helped us get past this mental block, so we could tell our story and garner the understanding we needed to tackle these issues.
“Lawns left untended necessitate a do-over, but well-tended yards only require steady maintenance to keep flourishing.”
I shouldn’t claim credit for the term; that goes to one of our Senior Staff Engineers, Nelz, who put in the work to evangelize the “gardening” terminology. To further flesh out the metaphor: lawns left untended necessitate a do-over, but well-tended yards only require steady maintenance to keep flourishing. Nature has a way, when left to its own devices, to grow and wither—usually in unwanted, unexpected places. For non-technical folks, this was a way to appreciate how systems evolved organically, with changing requirements and products.
Step 2: The garden party
Once folks appreciated the “why,” that set the stage for a bigger task: stopping product development for a sprint to tackle long-standing issues, an initiative we lovingly referred to as “the garden party.”
Our sprints last two weeks, so this was a meaningful pause, but it was too short to fully tackle all the issues we identified.
As much as I was excited by this momentum, this tactic should be used sparingly, given the disruption to the business and teams’ regular operations. In our case, we recognized and accepted this tradeoff, worried about the problems that otherwise wouldn’t be prioritized, as well as areas lacking direct ownership where teams didn’t have enough justification to work on them in their regular sprints.
When we did this, some of the projects we tackled included: node version upgrades, removing deprecated libraries, and scrubbing PII from logging systems.
And while the pause wasn’t enough time to address larger initiatives—system migrations, major database changes—it gave us the space to start conversations and make project plans. Sometimes, just having dedicated time to work through problems that’d otherwise fall through the cracks is still valuable.
Step 3: Normalize the maintenance
Any development pause is disruptive. Stakeholders agreed to the effort, but we knew this had to be the exception, we wanted the norm to be continuous maintenance, where we in effect amortize this cost across product initiatives. Addressing issues as they came up—in contrast to batching them for a garden party—is also a timelier and preferred approach.
The garden party served as a template for teams to manage their tech debt. We standardized our processes to surface maintenance tasks and to incorporate them into our regular workflows. Two of our senior engineers, Dig and Jan, finalized a procedure where continual gardening would be interspersed within each team; the goal is for each team to spend roughly 20% of their time each sprint on maintenance activities. We preferred to let teams decide and prioritize tasks for their 20% gardening investments, as opposed to a more centralized queue of issues.
Examples of regular maintenance tasks: replacing and consolidating web components, adding documentation and tests to existing systems, and redoing folder dependencies.
Regular maintenance tasks follow the same rubric as domain-specific bugs, but the trickier aspect is ensuring that the time is allocated and respected. That said, ⅕th-time isn’t enough space to work on complex problems that would otherwise rival other full-time projects. What usually happens is that someone starts a project in their 20% time, but eventually gets bogged down with complexity. In these scenarios, we escalate…
Step 4: Planning for bigger initiatives
Even with regular gardening established, we still need to scope and execute more comprehensive architectural changes.
This was where we looked to our colleagues for inspiration; product management norms in specs, epics, scopings, milestones, resourcing, etc. served as a baseline for defining technical projects.
“Even with regular gardening established, we still need to scope and execute on more comprehensive architectural changes.“
In applying the same amount of rigor to repairing tech debt as other initiatives, we allow ourselves to compare things. Prioritizing projects that originate from product teams, versus those from engineering teams, we can still contrast common dimensions like effort, expected impact, external deadlines, etc.
While Product Requirement Documents (PRDs) are pretty standard for product managers and design mocks/comps for designers, engineers’ experience and comfort levels with tech specs varied quite a bit on our team. As a result, we’d try to make decisions with incomplete data, comparing projects lacking comparable scope or detail, or depth of implementation.
Whenever possible, we prefer breaking out and attaching gardening tasks with major features and roadmap items. Not only does this sequencing help reduce the cost of feature implementation down the line, but having a line of sight towards the current feature along with its immediate iterations provides a direction for refactors, so the architecture doesn’t have to guess where the product goes next.
Where we are today
Today, we have 80% target test coverage and transitioned our codebases to Continuous Integration/Deployment. We’ve made targeted performance improvements, and regularly report updates broadly. Though far from perfect, we’ve experienced fewer sprint hiccups and more accurate estimates, and we will live with many legacy systems. Employee NPS scores have improved, though this tech debt strategy can’t take all the credit.
Reflecting a bit on what we’ve implemented
“Using our own words to describe problems let non-technical teams appreciate what engineering had to tackle and let us tell stories that made a stronger impact.”
Words matter
Using our own words to describe problems let non-technical teams appreciate what engineering had to tackle and let us tell stories that made a stronger impact.
Be strategic about how you “stop the world”
Stopping the world is a judgment call, and it’s a heavy hammer wielded sparingly. Ideally, schedule it during a slower period of the year (in accordance to the business/product calendar), with the scope dependent on how much the organization has accumulated up to that point. I try to let the engineers and leads decide what to work on, but some amount of top-down curation is useful to keep people focused on impactful work.
Build up to the big projects
It would have been hard to get buy-in after years of rapid product development. We first had to start with quick wins and straightforward tasks, such as standardizing to faster, common modal components, or improving queries to speed up API calls. The former can lead to a bigger project standardizing all components, and the latter can be the beginning of revamping queries holistically.
This is applicable elsewhere
Other functional teams have been inspired by what they’ve seen from Engineering and are now relating their long-standing, open-ended issues as their versions of “technical debt” and are even adopting these strategies to handle them. We’re excited to see that our approach resonates and seems to work beyond software development.
As for our team: we’ll maintain our gardens, take on major technical projects strategically, and keep the garden party event in our back pocket. Since I spoke about this at Elevate 2023, we have doubled down on gardening time by elevating its importance as an official KPI within the Engineering team, and we’ve identified and spun out a couple of projects (Epics, in JIRA parlance) that propel us to the next architectural milestone.
If you’re doing a good job, the product will evolve, and tech debt is the reward. Ironically, we might still falter: when the product requirements change faster than this process can accommodate, we’ll have to come up with a stronger approach—but that’s a great problem to have.
And that’s a wrap! Thanks to Allen for the gardening lesson! We hope that his guide will help you clear tech debt and minimize it in the future. See you soon, and feel free to share if you enjoyed it.
Cheers,
Quang & the team at Plato