Five steps to follow when Troubleshooting for Site Reliability / DevOps / Cloud Engineers etc .....
Content Covered
iii) Locating a Fault
iv) Clearing a Fault
v) Documenting the Troubleshooting Process
Lets Dive Deep into the Five Steps to follow while Troubleshooting
1 Collecting Information
Collect primary information. It helps to locate faults.
i)Information to Be Collected
The following information must be collected:
§ Fault symptom
§ Fault occurrence time and frequency
§ Operations performed when or before the fault occurs
§ Message tracing when the fault occurs
§ Related output information when the fault occurs
§ Alarms when the fault occurs
§ Logs when the fault occurs
§ Operations performed after the fault occurs
ii) Source of Information
Faults are often detected from the fault information that may be obtained from:
§ Complaints from users or the customer service center
§ Notices from neighbouring offices
§ Alarms reported by the alarm system
§ Information collected through various tools, such as the network management system and the protocol analyser eg wireshark
§ Information collected through device operations such as querying the device status, querying logs, and tracing messages
§ Abnormality detected in routine or preventive maintenance
iii)Necessity
Collecting the relevant information is very important when faults occur, because of the following reasons:
§ Fault location becomes more difficult due to network expansion and complex networking environments. Therefore, efficient collection of fault information is the key for quickly locating and clearing faults.
§ In most cases, fault information obtained by telephone does not help in identifying fault causes.
§ Simple primary information cannot meet the requirements of fault analysis. Therefore, collection of the related primary information is required for quickly locating and clearing faults.
iv)Maintenance Suggestion
The following provides the maintenance staff with some tips:
§ Gather information when a fault occurs. When a fault, especially a severe fault occurs, learn about it carefully, and then take the next steps, rather than handle it in haste.
§ Study the related information, especially the system principles and the signalling related information, to identify fault causes.
§ Ask questions from various aspects when receiving a complaint by phone.
§ Ensure that you have good communication with the maintenance staff of other offices or departments.
2-Identifying A Fault
After the fault information is gathered, you must identify the fault.
i)-Determining the Fault Scope
The fault scope refers to where the fault has occurred.For example, it can be in the functional module of the web stack, application stack or database stack in a three-stack application due to the module design in the platform.
ii)-Determining the Fault Type
When you analyse and categorize a fault, it is recommended that you take into consideration the service flows and functional modules of the platform that you managing.
iii)-Determining the Fault Severity
For the fault severity, it can be minor, major or critical.
3-Locating A Fault
A fault occurs in a unique way at a specific point of time. This determines the basic ways for locating the fault.
During the fault location, all possible causes of a fault are analysed. The relevant causes are then analysed to identify the actual cause of the fault.
Locating faults efficiently helps to troubleshoot faults on time, and avoids accidents caused by maloperation on the system.
Troubleshooting measures can be concluded based on fault location results.
4-Clearing A Fault
After determining the fault cause, you can perform troubleshooting.
To resume the normal system operation, relevant measures must be taken to clear faults. The measures include checking lines, replacing hardware, modifying data configuration, performing system switch-over, and resetting modules.
5-Documenting the Troubleshooting Process
After clearing a fault, the troubleshooting process must be documented. The documenting process is necessary because of the following reasons:
· The experience of troubleshooting the fault is documented to serve as a key troubleshooting reference for similar faults.
· The modification of network parameters is recorded and used for reference of future fault information collection.
The following should be documented:
· Fault Symptom and collected Information
· Network topology
· List of devices and media applied in the network
· List of protocols and applications adopted in the network
· Possible fault causes
· Solution and implementation result for each cause
· Experience obtained from the troubleshooting process
· Other information, such as references used in the troubleshooting process.
I thank you for your time and consideration to go through over this troubleshooting process article that l shared celebrating my birthday with my linkedin connection members ,feel free to add other actions items that l missed in the comment sections below.
Once again if you enjoyed this article, I’d be very grateful if you’d help by sharing with other connections as well, you can check my wall for other Knowledge-in-Tech that l share every week. Thank you!
Yours ---- Benjamin R Chiro ---SRE/DevOps/BSS Consultant Engineer
Superb.. good work...
Awesome