IT Problem Solving Part 1: Understanding Your Problem

This is the first of a series I'll be writing to help codify the repeatable processes I see when I observe successful IT professionals do their job well. The goal is to equip readers with a toolkit they can use when making decisions, solving problems, and going about making IT lives better.  

What's the best way to move an API to the cloud? By asking what problem you need to solve first.


I once took a multi-day course. It was called something like "Architecture with [insert cloud IaaS provider here]" and it was taught by a dude with a job title of Technology Evangelist. I and about 40 other IT veterans spent most of a week with each other. Tech leads, team leads, architects and one Technology Evangelist. Let's roll.

The course was certainly interesting and most of us walked out of it feeling like we could pass a certification exam. I didn't. Not even close.I wasn't there for a cert so that didn't bother me overmuch. I have zero certifications and, if I can keep it that way, I'll die happy. 

I was there for what was promised in the course title. How an architect can use someone else's infrastructure (you do know there's no such thing as cloud, right?) as a tool to achieve desired business outcomes. I wasn't looking for instructions on how to set up a geographically diverse content delivery network. I was looking for why and when to do it. That's what architecture is to me.

I solve problems for a living. My business and tech partners tell me what's ailing them and it's my job to prescribe the remedy. And keeping with the doctor / patient analogy, those partners often understand their symptoms. After all, they live it day after day. But when I tell my doctor I need an appendectomy, she smiles kindly and asks me questions about my tummy ache.

So when someone tells you that you need to move your API to the cloud, you need to be supportive and ask them to tell you about their downtime.

Far too often, we in IT are so quick to "add value" that we jump on a problem as stated and learn far too late that we solved the wrong one. Here's an example. I've changed the details to protect the innocent, but I left the important parts intact.

Solving the wrong problem

I had a project manager tell me that his project needed to move their app to a hot/hot architecture. Okay, why? Well, the project manager couldn't tell me much more than that the project was called "[Name of app] hot/hot". So I went to talk to the project's business sponsor.

She and I had a great relationship, so it wasn't too difficult for me to ask her what problem she's trying to solve. Turns out that the monolithic application that posts transactions that it receives from point-of-sale systems to various sub-ledgers goes down hard at some very inconvenient times. It went down during month-end processing last month and the CFO had some choice words. She didn't know why it went down, though, so I went to the app's tech owner.

This is the guy, I learned, who named the project "[Name of app] hot/hot". This is also the guy who told his senior VP that, if they'd been hot/hot, month-end wouldn't have failed. "Look, we already know what the problem is," he said to me. "I just need you to tell me how to do it." Alright, I went and talked to one of the tech leads.

Me: So why does the app crash at random times?

Tech lead: It's not random. Happens every time we get a large spike in POS transactions in an hour. Heavy sales, higher chance of crashing.

Me: Would spreading the load across data centers make that go away?

Tech lead: Maybe. I don't know.

Me: Why not?

Tech lead: Because the app was never meant to do what it's being asked to do.

I'll skip the boring details, but the real problem we were facing is that we used to have a back-end accounting app that operated on batch files and someone later bolted on an online transaction processing (OLTP) interface. The month end processing that the CFO was so very interested in was always handled by the batch side of the app, but the front-office and analytics teams wanted data to come in more quickly. When the OLTP interface went down, it took the whole thing with it.

In the end, we split up the OLTP and accounting pieces and had the OLTP piece write to a database that the batch processors could ingest offline. We also forked the data over to the sales and analytics stacks. We did, indeed, deploy the OLTP side across data centers (that was the name of the project, after all) but the real value we accomplished was to decouple two things that should never have been coupled in the first place.

The bedrock of your problem domain

As seen above, you need to dig down to the bedrock of your problem domain before you can build on top of it. In the example, the most surface-level problem statement was the business owner telling me that her app goes down at inconvenient times. (I won't count the project manager citing the name of the project.) We had to dig through the strata to see that the OLTP piece goes down when it's over-taxed by high sales volume. Finally, we hit the true bedrock of the problem when the developer told me that the app was never meant to handle online data velocities.

The tools you have for understanding your problem

There's a repeatable methodology you can follow by implementing the tools below to make sure you understand your problem domain:

Step 1: Ask multiple levels of why

Never take a problem at face value. You need to ask why, then ask it again, but with your question stated a little differently. Now ask it to someone a level deeper. Keep asking why and keep going deeper until you feel you know at what level your problem lives. 

Step 2: Understand the facts

Data is not the plural of anecdote. Don't let people stop at telling you vague notions. You need facts. If the server reboots "all the time" ask how often. It may be that it reboots 3 times a year or it may be 17 times an hour. Those are very different problems.

Also, people are experts in different things. Ask different people for different sides of the story. Flush out biases.

Finally, on this step, trust, yet verify. You have to do your own research and validate other peoples' assumptions.

Step 3: Root causes

Unless you understand your cause at its root, and all of its root causes, you don't understand your problem. If you only understand the surface-level symptoms, you run the risk of throwing money and time at solving the wrong solution. Worse, by solving the wrong problem, you often mask the root cause even deeper and make it even more difficult to detect for the next team wondering why end-of-month processing failed again.

Step 4: Document your problem statement

Write down the problem you actually need to solve and get stakeholders to agree on it. That way, if anyone ever wants to get a "quick win" by implementing a tactical solution, you can point out whether it still solves the real problem that we all agreed we have.

Disclaimers:

  1. Opinions expressed are solely my own and do not express the views or opinions of my employer.
  2. Any people referenced in this article are fictitious.

To view or add a comment, sign in

More articles by Dan McConkey

  • How We Should Define a 10x Developer

    We’ve all heard of the concept. There’s this mythical software uber dev who can produce ten-times the amount of code as…

    1 Comment

Others also viewed

Explore content categories