Web stack system breach incident report (Postmortem)
May 02, 2022
The following is the incident report for the Web stack system breach that occurred on April 04, 2022.
Issue Summary
At 2:00 AM GMT, the on duty engineer suspected that the web stack might have been misbehaving than usual. And that the server is resulting in 500 error code responses whenever he wants to access the data. In addition to that, some users are failing to gain access to their accounts through the privileged user when they try to log in as a superuser. The root cause of these system failures were caused by an attacker who managed to breach in.
Timeline (all time CAT)
Root Cause
At 2:10 AM the root user was left vulnerable after one of the machines was put into hibernation mode without making sure to logout. This made the system to be running in the background with all the privileges exposed. Which made it easy for the attackers to use the authentication key to the main server. After the attacker gained full access, some authentication keys of user accounts were changed, which made it impossible for them to log into their accounts remotely. In addition, 40% of the company’s data has been erased from the database, due to the SQL injection that took place during the attack. Therefore the servers began self shutdown, to prevent the attacker from going in further.
Resolutions and recovery
At 2:16 AM CAT, the monitoring systems alerted our on duty engineer who investigated and quickly escalated the issue. The incident response team identified the breach and removed the attacker from the system together with the virus.
At 2:54 AM, we started to check every user account if the authentication configuration is still working. This attempt wasn’t very successful since some private keys were changed during the attack. The user authentication keys were renewed and successfully configured new passwords by 4:15.
To help with the recovery, we turned off some of our monitoring systems which were triggering the virus. As a result at 4:19 all the servers were rebooted one more time, and manually this time around. Some of the data was recovered, at most 60% of the lost achieves. By 4:58,we managed to get the system back online with completely new authentications.
Corrective and Preventive Measures
In the past 3 weeks, we’ve been conducting full system review and analysis for the breach. The following have been put in place to address the underlying causes of the attack and to prevent recurrence and improve response times:
The response team is committed to continually and quickly improving the system infrastructure and operational processes to prevent outages.