Troubleshooting (explained) in 5 mins


Troubleshooting is finding out what happens and why it happens (AKA 'root cause').


Errors/Failures case

Situation: you've got a ticket/alarm about an error (or high error rate)

In general, troubleshooting can be broken down into these steps:

- Where is the issue (where to start investigation)

- What is happening exactly

- Why (what is the reason)


1. Where

What is error/alarm source: which service (name, region), which operation, when (time range).

Usually this information is provided in alarm.

If not then the alarm still has the metric name that is alarming on. You can search code for the metric name (try searching different parts of the metric name since it can be formed with several constants, e.g. it is quite common to have a prefix defined separately).

Now, based on this information, you can find relevant dashboard then in the dashboard locate graphs relevant to the operation.


2. What

What it is about? What is error message, error code.

If ticket doesn't provide this information it can be found in the application/service logs.

Based on this information you often can distinguish if the error caused by the service's code/logic or it is something external (throttling, dependency failure, wrong input).

In case if the error caused by response from the other service - get request id and find out what happens with this request in that other service logs, check dashboard of the other service. If the service is owned by a different team - find and engage owners (e.g. via ticket).


3. Get more context

If at this stage it is still not clear what is the problem exactly - you need to gather more details.

Answering following questions may help:

  • Does error happen for particular inputs? check for any commonality in arguments of failed requests (e.g. are failed requests coming from the same customer(s)?).

  • Any other commonality? e.g. are failures happen on particular set of hosts, only in some ADs, at particular time periods.
  • Any correlation? Are errors spike the same time where the service has request rate spike? or CPU/mem utilization spike? GC pauses etc. 
  • What percentage of requests failing? Are other operations on the service has issues as well?
  • Was there deployment, security patching around the time errors appeared?


4. Why (Root causing)

If the root cause is unclear after going through steps described above the further troubleshooting doesn't have generic solution, it has to rely on experience, knowledge, creativity, persistence/grit and sometimes on a teamwork.


Some advices:

  • Check old/resolved tickets generated by this alarm: maybe it was raised for the same/similar issue before and previous oncaller documented root cause and resolution in the ticket

  • Always document/log all discovered information in the ticket as you work on it

  • May be worth checking if other teams/services are having sev1s, sev2s the same time. E.g. network issue may impact many services in a non obvious way.

  • It is easy to get a 'tunnel vision' or go sidetracks, therefore from time to time stop, step back for a minute and review/reiterate on the big picture: how the issue manifests, what is known so far and what you are trying to check next, is there anything else to consider, ask yourself - what kind of problem can manifest like that.

  • If you've spent non-trivial amount of time and stuck or not sure what to do - don't hesitate to ask teammates for advice/help/opinion. If you are not sure whom to ask - ask your manager who can help.

  • If you are engaging another person into troubleshooting you should always provide context. Bad: "That service doesn't work right", good: "Service A has failures with error message B on API C in a region D".

  • If you have access to the service's code: take a look at it, the code is the main source of truth for answering "how it works". You can find relevant place in code by searching by class name or fragments of error message. Also good to double check that you are looking at the code revision which matches to what is deployed (the most recent tip of the branch may differ from what is deployed).


And remember... The world is not perfect

  • Documentation is not perfect
  • Not all error messages (or log entries) are helpful
  • Tools may not work sometimes


Don't be easily defeated by imperfection - try to workaround, find a way through, use your problem solving skills.

E.g. you may stumble on a dead link in the documentation - take a look at it, maybe you can find relevant information. If it is a dashboard link you may imply/guess service name from it and search for dashboards that starts with the same prefix or name.

The error message is unclear? - check the source code, it may give better understanding of what is going on. If the error doesn't come from the service code - google it, you may find how others solve/troubleshoot similar problems. Etc...

To view or add a comment, sign in

More articles by Max Yovenko

  • Serializability and Linearizability

    How to explain what is Linearizability? Well, there is an article in Wikipedia but it will require quite a bit of…

  • What is 'Distributed Systems' skill?

    Was asked what does 'distributed systems' knowledge mean? What somebody is expected to know claiming 'distributed…

    2 Comments
  • Some notes on interviews

    After being on the interviewer side of interviews for a while, I think, the interview can be quite opaque to the…

  • Code reviews and 'ideal' code

    Once I worked in the team that was way too much focused on code "quality": all methods should have proper comments, and…

    4 Comments

Others also viewed

Explore content categories