Code Blue at 4:07am...
Setting the scene:
CIX has a policy of monitoring the service supplied to every customer. Our Network Operations Centre (NOC) responds to alerts from this monitoring on a 24x365 basis. During the 10pm to 6am window, our standard operating procedure requires us to wait for fifteen minutes, if the equipment is customer owned, in case the alert is generated by a server reboot while security patches are being applied. In all other cases we react immediately. Each customer has supplied us with a unique escalation plan to respond to an alert. As a result of this process, more than 80% of service affecting issues are reported by our NOC to the customer, before the customer is aware of a problem.
On Sunday 4th June the following story evolved as a result of this process:
4:07am, the CIX Network Operations Centre (NOC) was alerted that two servers monitored for a customer stopped responding to pings. This indicates that the customer services may now be offline. But a visual check of the customer rack did not show any obvious problem.
4:25am, the NOC Commander began following the customer's escalation plan. Each phone number and email address was contacted in sequence but the customer was not reachable. Voicemail was left on phone systems but these were office phones. The final phone number in the escalation plan was an 'out of hours' mobile number at the end of the contact list but this number was no longer in service. As it was a bank holiday weekend here in Ireland, we feared these messages might not be heard until Tuesday.
9:00am, with no response from anybody on the escalation plan contact list, we trawled through email communications with our customer and eventually found a personal mobile phone number of a senior executive. We also left voicemail on that number.
11:00am, we made contact with the senior executive and she advised us that her company was now fully aware of the issue and was mobilising repair technicians to come on site.
14:15pm, two repair technicians arrived on site and identified that the 'top of rack' switch was not functioning. A power cycle recovered the failed switch.
15:00pm, full services were restored. The escalation plan was amended in agreement with the customer to include up to date contact details.
On Tuesday, 6th June, CIX received a thank you message from the company CEO. He explained that they offer an Enterprise SaaS solution predominantly to UK customers. Monday was a bank holiday in Ireland but not in the UK. He was delighted that the problem was rectified before it affected customers on Monday morning. He also asked us to add his personal mobile phone number to the end of our escalation plan as a final point of contact if all other contacts in the escalation plan failed to respond.
Another happy customer!
Great article Jerry. CIX have always been great at communicating outages (both planned and unexpected!) with your customers. I really appreciated that as a small customer of yours. Continued success!
Well done and nicely written
Nice job. ..