What's so hard about problem solving

Day after day, we're inundated with problems. Car won't start, spouse won't help with the household chores, cat won't use the litter box, etc. Unfortunately, I can't help you with these types of problems. I'm not a mechanic, a marriage counselor, nor a cat whisperer. But having spent the better part of 32 years in IT, I do know something about problem solving in that arena. And the truth is, there are some general principles that apply, regardless of the domain of your problem.

The first step is to accurately diagnose what the problem is. In today's increasingly complex, interconnected and integrated IT environment, this can provide to be a significant challenge. Connections are intermittent, systems are loosely coupled, translation and extraction layers abound as data flows through the enterprise. If it sounds difficult, that's because it is. Or more accurately, can be, but doesn't have to be. Let's go back to the car example. The car won't start. You could just replace the battery, but what if it still doesn't start? Maybe it's out of gas, so you somehow get to the gas station, put some in a can, and put it in the car when you get home. Still doesn't start? Hmm, what else would keep a car from starting? Truth is, it could be an number of things. Unfortunately, neither of the first two things you've tried worked, and now you are out that time and money and still don't have a working car. Next step, have the car towed to the shop where a mechanic can look at it. He puts it on the service computer, but that may not help either. Maybe the problem IS the computer chip in the car. Or maybe it's something else. But, a quality mechanic is going to start working the problem from the beginning to the end, until the solution is found and fixed. A bad mechanic may just start replacing stuff randomly and even if by chance the car ends up starting, won't know why (or worse, has the wrong reason why), and ends up costing you a lot of time and money.

So, back to the world of IT. How do we go about diagnosing and solving complex system problems? Same as a quality mechanic or other professional. By breaking down the problem into solvable bits, eliminating bits piece by piece until the problem is found. The solution though, well that may be another thing entirely, but you can't focus on that yet. One of the problems I have found is that developers often have limited diagnostician skills. They're great at debugging code, with a little help from today's modern, visual development environments, but that's not really what we're after here. It may come down to that, but doing so is really starting in the middle (or end) of the process rather than at the beginning. At that really is the root of the issue. Whether it's due to preconceived ideas about how things should work or the "my systems never have issues" mindset, the process must start at the point of the issue and start working backwards. Don't jump back three spots to look at something, don't start looking at data in an upstream system, etc., until you start at the beginning, which is accurately stating what the problem is.

So, let's say you have a job failure. Okay, let's look at the error message. Is it helpful? Do we have a call stack? Do we know what data was being processed? We recently had an issue during a cutover to a new service provider. We were getting authentication failures calling a service. Error messaging was spectacularly unhelpful, though that's a story for another day. This is a system that runs in AWS and integrates with a third party. Hard to see exactly what's going on. But, we did have options. So the first thing we did was try to make a call directly to the service provider using SoapUI with the credentials they provided. Hmm, well, that worked. Okay, what now. Well, the provider authentication credentials are encrypted and stored in the database. Are they being returned correctly as the decrypted values? Check. Next up? It starts to get a little more ambiguous. Could be services weren't recycled after credentials were updated, so we could recycle all the services as that doesn't take too long. But is that the right thing to do? You have to get other people involved to do so, and do it in an orderly fashion so that you don't take down the service for everyone that uses those some servers with a different provider. You could try calling your service through SoapUI, but that is still returning the same generic error message that you are getting in your log files. What about running the service locally, whether from a dev machine or a on premise data center. Might take a little work, or raise some potential security issues, but from a developer perspective, this may actually be ideal. And in fact, this is what we elected to do. We ran the service locally where we could monitor what was happening. And to everyone's surprise, it worked.

So, it works in one environment but not in another. Now what? Now it could be a code base issue, it could be a server issue, or it could be something else. Next step, eliminate what you can. So now, first validate the binaries are the same. They are, so that's not the issue. We try recycling the servers to make sure they didn't have any incorrect cached credentials. Still doesn't work. But we've narrowed the list of potential issues down to two: Some sort of server issue, or a white-listing issue. The most likely scenario at this time was a whitelisting IP issue, as the AWS environment was presenting different IP addresses than corporate. As it turns out, the vendor had mis-configured the whitelisted IPs, which caused transactions to be rejected with authentication failures.

So it seems like we did the right things in trying to diagnose and correct. And we did. But it wasn't without issue, because we focused on on the authentication failure and were predisposed to believe it was a password issue. This led the team to spend precious time trying to track down how the password could be wrong, when a simple test would have shown that it wasn't wrong in our system and we had previously validated it was correct. Instead of taking the issue and breaking it down into smaller, manageable pieces for diagnosis, we jumped to a preconceived idea about the problem and attempted to solve something that wasn't wrong.

So where to from here? Problem solving is a skill, one that must be developed and honed, just like working with the latest cloud technologies, Javascript frameworks, language features, or design and architecture skills. But it's probably the one skill that receives the least amount of focus. It's true that it's not as exciting as learning a new skill or technology, and not as marketable for new positions. "I don't know nodeJS or TypeScript, but I sure know how to break down a problem and get it resolved doesn't typically bring in the job offers, especially when the job requirement says must know nodeJS or TypeScript. And it's not the end-all be-all for developers, but it is a skill that must be honed. Whenever time to resolve is critical, having great problem solving skills is invaluable. So, how do you ensure your teams have the problem solving skills necessary? Interested to hear your thoughts.

To view or add a comment, sign in

More articles by Jonathan Schafer

Others also viewed

Explore content categories