It all went wrong overnight

It all went wrong overnight

While DevOps is a thing, and everyone wants to "learn it" it's all too familiar when working in the Ops half of DevOps that it's deemed to be quicker and easier to hop onto a server and make a change rather than test it, code it, test it, peer review it, create a change request and deploy it.

Why jump through all those hoops when I know what I'm doing and I can have it done in a few minutes with a copy and paste in bash.

Well here's why

echo "100.133.132.11:/media/exthdd /media/exthdd nfs auto,nofail,noatime,nolock,intr,tcp,actimeo=1800 0 0" > /etc/fstab && mkdir -p /media/exthdd && mount -        

This command worked flawlessly, it applied a new entry to /etc/fstab and mounted manually the new NFS mount point.

/media/exthdd was mounted fine

A week later all the servers were rebooted as part of a weekly patching cycle where once a week the servers get the latest kernel applied to them and at this point, all went terribly wrong.

The real reason this caused a problem was because

  1. I figured it was easier to do what I was doing using a quick command line hack
  2. I didn't test what I was doing first
  3. I didn't check with a second set of eyes that there wasn't a glaring mistake in the code.

Because I took the quick and easy route I missed this

No alt text provided for this image

I used a single > pipe to the /etc/fstab file which resulted in the file effectively-being overwritten with just my NFS mount in it and not the UUID Root / mount as well.

What you all, and I know is I should have used a >> to pipe my line to the end of the file.

The reality of all of this was a simple one, I did this on my home network, where I do run Git, I do run ansible and I could have deployed this as code using Rundeck. So I took out 5 servers and was able to recover this pretty quickly by booting off a Linux ISO and mounting my Disk and editing the fstab file with the information from blkid.

This took out my Database, Webserver, ChatServer and Rundeck Instance (I'm too tight to pay for HA on Vultr)

Even on these 4 servers, this took an hour or so to restore manually.

Imagine what would have happened on an enterprise solution across multiple sites, where someone had made this change weeks before, forgotten about it and needed to reboot hundreds of servers.

A simple server reboot would have taken out a site, service or everything depending on the setup.

And don't for a minute think, well with dev, integration, puppet, chef, ansible, whatever pipeline tools you use this won't happen to you.

Because the strongest weapon against this type of mistake isn't a code pipeline, its actually people. The peer review, the change request, the human gate between testing and production. The one place in many companies I've seen where a tick is put in a box because people are busy and there are no repercussions for doing that.

When a person writes the change, they should include a level of detail that you could walk out into the street and pick anyone and that person could run the change and regress it if needs be is the advice I give. (I actually say write it so I could do it).

Random Errors are also put in changes on purpose to see if the Peer review stage is being taken seriously as this is the human safeguard.

As someone once said to me, automation is great, it helps you deploy mistakes quicker..

Just a few Sunday afternoon thoughts and a hands up to making a stupid mistake.





To view or add a comment, sign in

More articles by David Field

Others also viewed

Explore content categories