Troubleshooting (explained) in 5 mins
Troubleshooting is finding out what happens and why it happens (AKA 'root cause').
Errors/Failures case
Situation: you've got a ticket/alarm about an error (or high error rate)
In general, troubleshooting can be broken down into these steps:
- Where is the issue (where to start investigation)
- What is happening exactly
- Why (what is the reason)
1. Where
What is error/alarm source: which service (name, region), which operation, when (time range).
Usually this information is provided in alarm.
If not then the alarm still has the metric name that is alarming on. You can search code for the metric name (try searching different parts of the metric name since it can be formed with several constants, e.g. it is quite common to have a prefix defined separately).
Now, based on this information, you can find relevant dashboard then in the dashboard locate graphs relevant to the operation.
2. What
What it is about? What is error message, error code.
If ticket doesn't provide this information it can be found in the application/service logs.
Based on this information you often can distinguish if the error caused by the service's code/logic or it is something external (throttling, dependency failure, wrong input).
In case if the error caused by response from the other service - get request id and find out what happens with this request in that other service logs, check dashboard of the other service. If the service is owned by a different team - find and engage owners (e.g. via ticket).
Recommended by LinkedIn
3. Get more context
If at this stage it is still not clear what is the problem exactly - you need to gather more details.
Answering following questions may help:
4. Why (Root causing)
If the root cause is unclear after going through steps described above the further troubleshooting doesn't have generic solution, it has to rely on experience, knowledge, creativity, persistence/grit and sometimes on a teamwork.
Some advices:
And remember... The world is not perfect
Don't be easily defeated by imperfection - try to workaround, find a way through, use your problem solving skills.
E.g. you may stumble on a dead link in the documentation - take a look at it, maybe you can find relevant information. If it is a dashboard link you may imply/guess service name from it and search for dashboards that starts with the same prefix or name.
The error message is unclear? - check the source code, it may give better understanding of what is going on. If the error doesn't come from the service code - google it, you may find how others solve/troubleshoot similar problems. Etc...