The High Cost of Technical Debt: A Case Study
Technical Debt: we all have it. Yet, this phenomenon remains poorly understood by product managers. Unlike financial debt, the costs are often hidden and difficult to measure. But the most dangerous aspect is that "Technical Debt items are contagious, causing other parts of the system to be contaminated with the same problem, which may lead to nonlinear growth of interest." [1]
Here's a case study of one such event; unmanaged tech debt caused interest costs to spiral catastrophically out of control.
To set the scene, we have an online shopping system with the following requirements:
- A catalog of 30,000 total products
- 500 peak concurrent users
- An index page that displays the top 24 products in a chosen category (for example, a list of Blu-ray titles).
How much would you expect to pay for the SQL cluster running this catalog? Would you believe $360,000 per year in raw IaaS expenses? How could this happen?
Ten years ago, the system's creators built the catalog on a custom data access layer. At the time, it worked pretty well, but the architecture had a fatal flaw.
As years passed, the system grew. New discount coupons, special promotions, pricing tiers, inventory management, wishlists, and fulfillment features piled on. What was once a nimble system grew to 4 million lines of source code. The fatal flaw spread, making the code nearly unmaintainable. This in turn created a feedback loop: The code was so hard to understand it could no longer be refined during regular iteration cycles, and obsolete or unused features could not be removed. The system got bigger still.
Along the way, system crashes were common. Downtime averaged about 1 day per month. Each time, the only option for bringing it back online quickly was to add more hardware. Bit by bit, the SQL cluster grew into a $360,000 monstrosity.
The total cost of this tech debt will never be calculated, but it probably runs in the millions: lost revenue, customer service calls, lost employee productivity, and tarnished brand equity. If minor investments had been made in correcting this flaw early, the massive losses were entirely preventable.
If you're a product manager, you urgently need to track your tech debt. Maintain a backlog, and dedicate a portion of your iteration budget to paying down the debt. Make a realistic economic estimate of the costs of delaying this debt paydown. Understand how dangerous this problem can become when it becomes contagious. Don't be my next case study!
Dylan Tack love the article! Curious if you think that in the future businesses will be able to utilize blockchain tech to curb these costs?
I don't think there's a simple answer. Lately I've been influenced a lot by "The Principles of Product Development Flow", by Donald Reinertsen. This book claims traditional ROI calculations are counterproductive when applied to product development, and instead advocates a scheduling metric based on the cost of delay divided by the task duration. In other words, the highest economic priority is always the job that generates the most cost-of-delay savings per unit of bottlenecked resource. Figuring out how to untangle that, and apply it to software engineering, is perhaps a topic for a future post. I think if an organization can track it's debt (via backlog tasks), and make even crude guesstimates at the economic cost of delaying each task, then they are ahead of 95% of orgs out there.
Great work putting some (shocking) numbers to this. Do you have any ballpark recommendations for what portion of the budget to dedicate to tech debt type issues to be more sustainable?