Troubleshooting (explained) in 5 mins

Max Yovenko

Published Sep 23, 2023

+ Follow

Troubleshooting is finding out what happens and why it happens (AKA 'root cause').

Errors/Failures case

Situation: you've got a ticket/alarm about an error (or high error rate)

In general, troubleshooting can be broken down into these steps:

- Where is the issue (where to start investigation)

- What is happening exactly

- Why (what is the reason)

1. Where

What is error/alarm source: which service (name, region), which operation, when (time range).

Usually this information is provided in alarm.

If not then the alarm still has the metric name that is alarming on. You can search code for the metric name (try searching different parts of the metric name since it can be formed with several constants, e.g. it is quite common to have a prefix defined separately).

Now, based on this information, you can find relevant dashboard then in the dashboard locate graphs relevant to the operation.

2. What

What it is about? What is error message, error code.

If ticket doesn't provide this information it can be found in the application/service logs.

Based on this information you often can distinguish if the error caused by the service's code/logic or it is something external (throttling, dependency failure, wrong input).

In case if the error caused by response from the other service - get request id and find out what happens with this request in that other service logs, check dashboard of the other service. If the service is owned by a different team - find and engage owners (e.g. via ticket).

Recommended by LinkedIn

Staged Updates: The Only Way to Avoid Another…

🢆 AJ Godahewa 1 year ago

Hardware Security Verification Example: CWE-1223 Race…

Anders Nordstrom 4 years ago

Hardware Security Verification Example: CWE-1272

Anders Nordstrom 5 years ago

3. Get more context

If at this stage it is still not clear what is the problem exactly - you need to gather more details.

Answering following questions may help:

Does error happen for particular inputs? check for any commonality in arguments of failed requests (e.g. are failed requests coming from the same customer(s)?).

Any other commonality? e.g. are failures happen on particular set of hosts, only in some ADs, at particular time periods.
Any correlation? Are errors spike the same time where the service has request rate spike? or CPU/mem utilization spike? GC pauses etc.
What percentage of requests failing? Are other operations on the service has issues as well?
Was there deployment, security patching around the time errors appeared?

4. Why (Root causing)

If the root cause is unclear after going through steps described above the further troubleshooting doesn't have generic solution, it has to rely on experience, knowledge, creativity, persistence/grit and sometimes on a teamwork.

Some advices:

Check old/resolved tickets generated by this alarm: maybe it was raised for the same/similar issue before and previous oncaller documented root cause and resolution in the ticket

Always document/log all discovered information in the ticket as you work on it

May be worth checking if other teams/services are having sev1s, sev2s the same time. E.g. network issue may impact many services in a non obvious way.

It is easy to get a 'tunnel vision' or go sidetracks, therefore from time to time stop, step back for a minute and review/reiterate on the big picture: how the issue manifests, what is known so far and what you are trying to check next, is there anything else to consider, ask yourself - what kind of problem can manifest like that.

If you've spent non-trivial amount of time and stuck or not sure what to do - don't hesitate to ask teammates for advice/help/opinion. If you are not sure whom to ask - ask your manager who can help.

If you are engaging another person into troubleshooting you should always provide context. Bad: "That service doesn't work right", good: "Service A has failures with error message B on API C in a region D".

If you have access to the service's code: take a look at it, the code is the main source of truth for answering "how it works". You can find relevant place in code by searching by class name or fragments of error message. Also good to double check that you are looking at the code revision which matches to what is deployed (the most recent tip of the branch may differ from what is deployed).

And remember... The world is not perfect

Documentation is not perfect
Not all error messages (or log entries) are helpful
Tools may not work sometimes

Don't be easily defeated by imperfection - try to workaround, find a way through, use your problem solving skills.

E.g. you may stumble on a dead link in the documentation - take a look at it, maybe you can find relevant information. If it is a dashboard link you may imply/guess service name from it and search for dashboards that starts with the same prefix or name.

The error message is unclear? - check the source code, it may give better understanding of what is going on. If the error doesn't come from the service code - google it, you may find how others solve/troubleshoot similar problems. Etc...

To view or add a comment, sign in

Troubleshooting (explained) in 5 mins

Max Yovenko

Recommended by LinkedIn

More articles by Max Yovenko

Others also viewed

Why InfoSec fails at Product Security (Pt.1: Constraints)

Hardware Security Verification Example CWE-1243: Sensitive Non-Volatile Information Not Protected During Debug

Hardware Security Verification Example: CWE-1245 Improper Finite State Machines (FSMs) in Hardware

Security and Compliance as Code

How to Develop a Secure System in 6 Easy Steps: Step 1)

Hardware Security Verification Example: CWE-441 Unintended Proxy or Intermediary (‘Confused Deputy’)

Upgrade Your SMB Share Exploration with smbclient-ng!

How to use scopes to enforce security within VMware Aria Operations

Can We Share Information? (Log4j and CODESYS)

NSE: Submission of VAPT Report and/or Action Taken Report (ATR)/ Compliance Report as per Circular NSE/INSP/63908 dated 12-Sep-2024

How to Troubleshoot Organizational Processes

How to Troubleshoot KUBERNETES Issues

Network Troubleshooting Steps

How to Analyze Equipment Failures

Explore content categories

Recommended by LinkedIn

More articles by Max Yovenko

Serializability and Linearizability

What is 'Distributed Systems' skill?

Some notes on interviews

Code reviews and 'ideal' code

Others also viewed

Why InfoSec fails at Product Security (Pt.1: Constraints)

Hardware Security Verification Example CWE-1243: Sensitive Non-Volatile Information Not Protected During Debug

Hardware Security Verification Example: CWE-1245 Improper Finite State Machines (FSMs) in Hardware

Security and Compliance as Code

How to Develop a Secure System in 6 Easy Steps: Step 1)

Hardware Security Verification Example: CWE-441 Unintended Proxy or Intermediary (‘Confused Deputy’)

Upgrade Your SMB Share Exploration with smbclient-ng!

How to use scopes to enforce security within VMware Aria Operations

Can We Share Information? (Log4j and CODESYS)

NSE: Submission of VAPT Report and/or Action Taken Report (ATR)/ Compliance Report as per Circular NSE/INSP/63908 dated 12-Sep-2024

Similar topics

How to Troubleshoot Organizational Processes

How to Troubleshoot KUBERNETES Issues

Network Troubleshooting Steps

How to Analyze Equipment Failures

Explore content categories