Five steps to follow when Troubleshooting for Site Reliability / DevOps / Cloud Engineers etc .....

Benjamin R Chiro

Published Mar 8, 2020

+ Follow

Content Covered

i) Collecting Information

ii) Identifying a Fault

iii) Locating a Fault

iv) Clearing a Fault

v) Documenting the Troubleshooting Process

Lets Dive Deep into the Five Steps to follow while Troubleshooting

1 Collecting Information

Collect primary information. It helps to locate faults.

i)Information to Be Collected

The following information must be collected:

§ Fault symptom

§ Fault occurrence time and frequency

§ Operations performed when or before the fault occurs

§ Message tracing when the fault occurs

§ Related output information when the fault occurs

§ Alarms when the fault occurs

§ Logs when the fault occurs

§ Operations performed after the fault occurs

ii) Source of Information

Faults are often detected from the fault information that may be obtained from:

§ Complaints from users or the customer service center

§ Notices from neighbouring offices

§ Alarms reported by the alarm system

§ Information collected through various tools, such as the network management system and the protocol analyser eg wireshark

§ Information collected through device operations such as querying the device status, querying logs, and tracing messages

§ Abnormality detected in routine or preventive maintenance

iii)Necessity

Collecting the relevant information is very important when faults occur, because of the following reasons:

§ Fault location becomes more difficult due to network expansion and complex networking environments. Therefore, efficient collection of fault information is the key for quickly locating and clearing faults.

§ In most cases, fault information obtained by telephone does not help in identifying fault causes.

§ Simple primary information cannot meet the requirements of fault analysis. Therefore, collection of the related primary information is required for quickly locating and clearing faults.

iv)Maintenance Suggestion

The following provides the maintenance staff with some tips:

§ Gather information when a fault occurs. When a fault, especially a severe fault occurs, learn about it carefully, and then take the next steps, rather than handle it in haste.

§ Study the related information, especially the system principles and the signalling related information, to identify fault causes.

§ Ask questions from various aspects when receiving a complaint by phone.

§ Ensure that you have good communication with the maintenance staff of other offices or departments.

2-Identifying A Fault

After the fault information is gathered, you must identify the fault.

i)-Determining the Fault Scope

The fault scope refers to where the fault has occurred.For example, it can be in the functional module of the web stack, application stack or database stack in a three-stack application due to the module design in the platform.

ii)-Determining the Fault Type

When you analyse and categorize a fault, it is recommended that you take into consideration the service flows and functional modules of the platform that you managing.

iii)-Determining the Fault Severity

For the fault severity, it can be minor, major or critical.

3-Locating A Fault

A fault occurs in a unique way at a specific point of time. This determines the basic ways for locating the fault.

During the fault location, all possible causes of a fault are analysed. The relevant causes are then analysed to identify the actual cause of the fault.

Locating faults efficiently helps to troubleshoot faults on time, and avoids accidents caused by maloperation on the system.

Troubleshooting measures can be concluded based on fault location results.

4-Clearing A Fault

After determining the fault cause, you can perform troubleshooting.

To resume the normal system operation, relevant measures must be taken to clear faults. The measures include checking lines, replacing hardware, modifying data configuration, performing system switch-over, and resetting modules.

5-Documenting the Troubleshooting Process

After clearing a fault, the troubleshooting process must be documented. The documenting process is necessary because of the following reasons:

· The experience of troubleshooting the fault is documented to serve as a key troubleshooting reference for similar faults.

· The modification of network parameters is recorded and used for reference of future fault information collection.

The following should be documented:

· Fault Symptom and collected Information

· Network topology

· List of devices and media applied in the network

· List of protocols and applications adopted in the network

· Possible fault causes

· Solution and implementation result for each cause

· Experience obtained from the troubleshooting process

· Other information, such as references used in the troubleshooting process.

I thank you for your time and consideration to go through over this troubleshooting process article that l shared celebrating my birthday with my linkedin connection members ,feel free to add other actions items that l missed in the comment sections below.

Once again if you enjoyed this article, I’d be very grateful if you’d help by sharing with other connections as well, you can check my wall for other Knowledge-in-Tech that l share every week. Thank you!

Yours ---- Benjamin R Chiro ---SRE/DevOps/BSS Consultant Engineer

Tariq Mehmood 5y

Superb.. good work...

1 Reaction

Isah Samuel Odoh 6y

Awesome

1 Reaction

See more comments

To view or add a comment, sign in

Five steps to follow when Troubleshooting for Site Reliability / DevOps / Cloud Engineers etc .....

Benjamin R Chiro

1 Collecting Information

2-Identifying A Fault

3-Locating A Fault

4-Clearing A Fault

5-Documenting the Troubleshooting Process

More articles by Benjamin R Chiro

Others also viewed

Computer Networking Fundamentals for DevOps Engineers: Why Networking Skills Make or Break Your Infrastructure

"Networking" The Bridge to Cloud and DevOps Success

What ‘Software-Defined’ Really Means, Part One

OSI Model in DevOps: A Deep Dive into Networking and Automation

Site Reliability Engineer III- AZURE

Let’s Build Infra the Right Way – Starting from Networking Basics"

The Mismatch: Minutes for VMs, Weeks for Networks

5G Core Cloud-Native Operations – Helm, CI/CD & Observability for Telecom Engineers (2026 Guide)

DRP - An Infrastructure Service Mesh

Enforcing a Secure, Scalable Home/Remote Network with Zero Trust, DNS, SSH, and DevOps Practices

Explore content categories

1 Collecting Information

2-Identifying A Fault

3-Locating A Fault

4-Clearing A Fault

5-Documenting the Troubleshooting Process

More articles by Benjamin R Chiro

Reasons Why You Should Opt to get Certified in Cloud Computing and DevOps

Others also viewed

Computer Networking Fundamentals for DevOps Engineers: Why Networking Skills Make or Break Your Infrastructure

"Networking" The Bridge to Cloud and DevOps Success

What ‘Software-Defined’ Really Means, Part One

OSI Model in DevOps: A Deep Dive into Networking and Automation

Site Reliability Engineer III- AZURE

Let’s Build Infra the Right Way – Starting from Networking Basics"

The Mismatch: Minutes for VMs, Weeks for Networks

5G Core Cloud-Native Operations – Helm, CI/CD & Observability for Telecom Engineers (2026 Guide)

DRP - An Infrastructure Service Mesh

Enforcing a Secure, Scalable Home/Remote Network with Zero Trust, DNS, SSH, and DevOps Practices

Similar topics

Network Troubleshooting Steps

Tips for Continuous Improvement in DevOps Practices

Explore content categories