What I talk about when I talk about Technical Debt.

Communicating technical debt to people other than engineers is essential to getting work on that debt prioritized and valued alongside bugs and product roadmap work, and it’s not easy at all. One key quality distinguishing a good engineering leader from a great one is the ability to bring engineers and non-engineers to agreement about technical debt and its priority. In this article, I'll talk about how we've done that at Teamworks.

For a long time we struggled with differing definitions of technical debt by different parts of the company and a lack of ability to communicate the urgency of tackling it.

We arrived at this definition:

Technical debt is the net-difference between process or code which services a short-term need but fails to service a predictable interim or long-term need without negative impact to the organization or its customers.

This doesn’t attempt to provide a taxonomy of technical debt. It doesn’t establish a framework to determine priority (I’ll talk about how we do it a bit later). But it does establish a hard line between what is and what is not technical debt and gets everyone on the same page.

Ward Cunningham first used the debt metaphor to talk about code problems that weren’t exactly bugs but rather things that made the code harder to understand and modify. It’s a good metaphor. It describes these problems as having a cost associated with them. It provides for the idea that there’s a principal and an interest rate, even if it doesn’t define how you arrive at those things.

Venture-backed companies must embrace strategic tech debt.

Venture-backed startup companies generally share the characteristic that they spend money faster than they can make it to expand into new markets. This deficit spending is a conscious choice and makes long-term sense. Most importantly it tends to pervade every decision made about how to allocate capital in a company.

That includes how that company allocates technical capital. A venture-backed software company has to build software faster than it can refine it. Getting into new markets and getting to market fit faster than competitors require lean experimentation alongside a codebase that’s also serving a well-developed base of paying customers who count on an agreed-upon service. This leads to conscious adoption of technical debt in the service of growing the company.

In an investment-backed company, you need to strategically embrace technical debt. Just remember to understand it, document it, and budget for paying it down before it buries you.

Let's start with a concrete example.

To get their renewal, someone promised an important customer who was already on the fence about renewing that tracking wearable data would be available by April 21. To make that happen, you had to cut some corners. A a new section of config has to be done for each user for the feature to work. Without a detailed update to the profile screen that walks someone through connecting their watch to the app, one of your engineers has to do it in SQL and by making API calls with the provider.

The cost of the customer doing it themselves is $0 and throughput is basically instantaneous, but it'd require those screens to be built – including validation and failsafes and OAuth handshakes.

The cost of someone in customer support doing it with a quick-and-dirty screen is, oh let's say $15/user for their time. But also the time that it takes for CS to do a rollout for a customer can't be allowed to be a drag on CSAT, so there may be additional opportunity costs if they're updating profiles en masse and letting other work pile up. You also have to consider the throughput time. Depending on their queuing strategy and time guarantees, the time from when a user realizes they need the new feature until they can use it goes from minutes to a day or so.

The ongoing cost of an engineer doing this in SQL and API calls is:

  • CS writes the support ticket and puts it on the engineering queue to be prioritized
  • An engineer stops work on features (possibly even the screen that cures this tech debt) and modifies the SQL template for a specific ticket, costing them a few minutes to an hour (cheap) and a context switch (expensive)
  • Another engineer spends their peer review time making sure the SQL is correct
  • Someone with permissions to execute SQL against the production database and API calls against the production vendor account runs it.

So now the user opens a support ticket to get on the new feature. Support sees it and forwards it along. Engineering uses their SDLC process to accomplish it or their 2nd-tier technical support process if you have a 2nd-tier support team.

Counting all the handoffs, the cost is now in the hundreds of dollars per ticket if not pushing four figures. The throughput is now a day or more depending on how disciplined your engineering team is about customer-facing problems and it has impacts to your team's ability to push new features. The tickets interrupt people. Engineers can't be as adept about making context switches as other functions in your organization, so you'll lose more than the few minutes of active work it takes people to service the ticket. This is technical debt, and these are its associated costs.

Again though, this is not a matter of bad vs. good. The above impact scenario has its place. I've done it, but when I did I knew full well what I was doing. The point is that when you take on tech debt, you're aware of its scope, you document the impacts, and you communicate the need to clean it up. Sometimes the most debtful scenario is fine long term because in actuality it amounts to a few tickets per quarter and it's not worth diverting the team to write a fully hardened, well designed screen that puts the setup in the user's hands.

Quantifying and Communicating Debt.

In his article on Technical Debt, Martin Fowler characterizes technical debt this way:

Software systems are prone to the build up of cruft — deficiencies in internal quality that make it harder than it would ideally be to modify and extend the system further. Technical Debt is a metaphor, coined by Ward Cunningham, that frames how to think about dealing with this cruft, thinking of it like a financial debt. The extra effort that it takes to add new features is the interest paid on the debt.

This definition is a good one for software development in a vacuum, but it’s not all that useful in a company setting. The challenge with tech debt in a company is getting it on the docket when there are features to develop and territory to capture. In the enterprise, technical debt has impact well beyond engineering concerns. It includes:

  • The cost of providing adequate customer support.
  • Cost of providing performant and reliable software.
  • Cost of continued scaling of the customer base.
  • Cost of ensuring regulatory compliance and security.
  • Throughput of individual support requests and their impact on customer relationships and retention.
  • Cost of hiring engineers who can make system modifications reliably in bounded time.
  • Ability to execute on high- to medium- priority items in a product roadmap in a sane amount of time.
  • The impact of customer frustration from user “toil” and confusion necessitated by engineering around existing behavior i.e. “You have to have this permission and go to that strangely named screen, and then do your task in 8 click-and-waits because that’s the only way we could build it.”

The impact of technical debt is the sum of these costs that are themselves the result of solvable technical issues, shortcuts to market (like mechanical turking), and costly adaptations (hacks). Too much debt can drag on a company’s KPIs across the board. A pragmatic approach to planning and accounting for technical debt on the other hand allows you to achieve things in a timeframe you couldn’t otherwise.

All of these costs, importantly, are quantifiable. You can calculate the increased cost of customer support. You can calculate the cost of churn due to low customer satisfaction. You can calculate the cost of your R&D department having to re-engineer and bootstrap a project to work around the problems from poorly maintained production software. And since you can quantify that cost, you can communicate it to the CEO, COO, and CFO.

The important part of communicating to the non-technical parts of the organization is quantification. If you can make a spreadsheet or a graph of it and relate it to ARR, you can relay debt in terms that are meaningful to everyone who isn't an engineer. I say cost and not other metrics because cost is always meaningful. There's no way to make it a vanity metric. No-one cares that you can improved garbage collection times by 50%. Everyone cares that you can eliminate $125,000 a quarter from cloud costs with one month of work.

In the above scenario, the cost of missing your engineering deadline is losing the renewal to churn. To the company that's, say $850,000 in ARR. It's also the cost of missing quarterly earnings numbers, dropping Gross Retention, and so on. So as long as the cost to engineering and CS, etc is less than that number, taking on the technical debt and maintaining it without fixing it is worthwhile.

When you tell engineers why they're not just delaying the feature by another sprint or two, this is what you tell them. When you tell folks handling renewals why this feature is being prioritized over others, making their negotiations harder, this is the calculus you give them. There's a number. The math works out.

Yes, there are impacts to this, like engineers slowing down on feature development to handle support tickets, and like CS being forced to be really precise and check their work on a new screen that makes them handle the rollout of hundreds of users person by person. But in a controlled timeline, it works. And when you tell everyone that you're delaying something else by a couple of sprints to get the screen in, even after 95% of existing users have been migrated onto it, you point out the ongoing cost to everyone of running the above processes to add all the users every time sales signs a new customer.

And thus your technical debt gets cleaned up.

Principal vs. Interest.

In my experience, the principal of the tech debt is the cost of what it will take to provide a solution that eliminates the negative impacts. The interest is the toil and drag across the organization involved in using and maintaining the debt-financed solution and the growth of that toil over time as other code and company process has to work around or incorporate the tech-debt in order to get its job done.

Take for example a function that lets you create and update a form, but there’s nothing self-service to delete it; imagine that deleting and cascading was more complicated, so you needed longer to consider how to do it right. Not having the deletion is the principal. The interest on that principal consists of the time and cost of every ticket your DBA had to take care of manually in SQL in order to delete a customer’s form. It's the number of times you had to restore a backup or otherwise fix the database when the DBA made a mistake. It’s also the cost to the company’s reputation of all the times that it took longer than its customers felt was reasonable to delete that form, or they were impacted by mistakes.

Be outcome-oriented in describing debt clearing work.

When prioritizing tech-debt in the backlog, it helps to describe consequences to be mitigated instead of the solutions you plan to use to mitigate.

Imagine a system where you’re going from a single-instance cache to a highly-available, scalable cache. Your description of the work could be “Switch web caching from our managed redis to a managed ElastiCache,” That communicates nothing about the why, and in a backlog of 1000 tickets the title tells the product owner nothing about how to prioritize that vs. everything else. A better ticket would be titled, “Cache misses are causing users to complain about slow performance at peak times.”

Anticipate and plan for change, or “Build Technical Capital”

There is an important distinction between doing work that anticipates change vs. work in reaction to it. Work that is done as a reaction to change is often paying down principal or interest on technical debt. Work that’s done in anticipation of change expands our overall technical capital.

To illustrate the difference, consider a mobile/web shared code project. At some point in the past we began using React on web as well as React Native to build our application. In the beginning, there was very little shared code, but we anticipated that much of the code between web and mobile would be shared in the future. If we had taken that foreknowledge and applied it then to solving the problem of “how to share code between web and mobile”, prioritized, and scheduled that work, that work would not have been “technical debt.”

Why? We realized our code repo was inadequate to future needs. Why is that not Technical Debt? The answer to that is also the answer to the question “Was there realized impact or did we get ahead of it?” It’s the difference between being forced to react vs. having the advantage of the situation. If we had done it then, we could have done it without also being impacted by the negative consequences that came with waiting too long to address it.

Instead we didn’t plan the work ahead and we had to build a shared-code solution while also experiencing development drag from engineers manually keeping shared code in sync.

Prioritize proactive work by thinking about the technical debt you take on if it’s ignored.

Engineering priorities

This bucket of work is for experimentation and work that has the potential for positive disruption. It’s a bucket for work where the engineering department can be the force for innovation. Think “labs.” If when you crafted your story, you thought “Things are pretty good, but I think they could be way better.” then you have an engineering priority.

Prerequisite and requisite work

Prerequisite and requisite work is the work that should be done before building or revising a feature, or should be done in order to make the feature complete. This is often the work needed to make new development conform to engineering standards of quality, testability, and performance. Examples of this include:

  • Providing self-service admin functionality.
  • Refactoring code somewhere else in the stack that is common to the new / revised feature, so that it can be shared between the old and new.
  • Bulk uploads.
  • Settings screens.

Sometimes prerequisite work can be skipped and a feature can still be shipped, but it will be more costly to maintain and modify than a fully complete project. This causes technical debt to incur and thus the priority of the work can be based on that impact.

Bugfixes and routine maintenance

The difference between a bug and an item of technical debt is obvious most of the time. The distinction is blurry when the bug doesn’t affect the correctness of output, only some aspect of importance to engineering or operations. In some cases, the distinction may be down to urgency or whether treating it as technical debt can bring a single item into the context of a wider cleanup push.

Prioritize technical debt paydown

The key concepts are Urgency and Impact. Another key activity for grooming technical debt, however, is contextualization. This is the planning activity of organizing technical debt into well-scoped refactoring plans, epics, and the collapsing of closely related stories. This makes it so that we can tackle more technical debt than we could grabbing a few stories off the stack. Teams should groom technical debt carefully and where possible create proactive solutions like refactors vs. playing “whack-a-mole” with issues that haven’t changed since they were initially reported.

Conclusion.

Too often, "technical debt" is a meaningless phrase in a company setting. Making it meaningful is about showing the wider impact technical debt has to the organization. Everyone is impacted by technical debt, and so everyone has to collaborate on fixing it, whether that's writing code, adjusting timelines, smoothing over bumps with customers, or yielding budget dollars to help with the paydown. To achieve that kind of collaboration, though, you have to become a great communicator of technical debt to technical people and non-technical people alike.

By characterizing the debt in terms of cost, the pay-down in terms of impact, and making conscientous choices about tech debt to take on, you will manage your company's technical debt balance effectively and not let it compound until it stalls growth.