It all went wrong overnight

David Field

Published Apr 10, 2022

While DevOps is a thing, and everyone wants to "learn it" it's all too familiar when working in the Ops half of DevOps that it's deemed to be quicker and easier to hop onto a server and make a change rather than test it, code it, test it, peer review it, create a change request and deploy it.

Why jump through all those hoops when I know what I'm doing and I can have it done in a few minutes with a copy and paste in bash.

Well here's why

echo "100.133.132.11:/media/exthdd /media/exthdd nfs auto,nofail,noatime,nolock,intr,tcp,actimeo=1800 0 0" > /etc/fstab && mkdir -p /media/exthdd && mount -

This command worked flawlessly, it applied a new entry to /etc/fstab and mounted manually the new NFS mount point.

/media/exthdd was mounted fine

A week later all the servers were rebooted as part of a weekly patching cycle where once a week the servers get the latest kernel applied to them and at this point, all went terribly wrong.

The real reason this caused a problem was because

I figured it was easier to do what I was doing using a quick command line hack
I didn't test what I was doing first
I didn't check with a second set of eyes that there wasn't a glaring mistake in the code.

Because I took the quick and easy route I missed this

I used a single > pipe to the /etc/fstab file which resulted in the file effectively-being overwritten with just my NFS mount in it and not the UUID Root / mount as well.

What you all, and I know is I should have used a >> to pipe my line to the end of the file.

The reality of all of this was a simple one, I did this on my home network, where I do run Git, I do run ansible and I could have deployed this as code using Rundeck. So I took out 5 servers and was able to recover this pretty quickly by booting off a Linux ISO and mounting my Disk and editing the fstab file with the information from blkid.

Recommended by LinkedIn

🌟 Unlocking the Power of Docker: A DevOps Engineer’s…

Tarun Kumar 1 year ago

Day-12/60: Mastering Env Vars & Volumes: The Real…

Shruthi Chikkela 8 months ago

Mastering Docker Containers: A Comprehensive Guide to…

Neamul Kabir Emon 2 years ago

This took out my Database, Webserver, ChatServer and Rundeck Instance (I'm too tight to pay for HA on Vultr)

Even on these 4 servers, this took an hour or so to restore manually.

Imagine what would have happened on an enterprise solution across multiple sites, where someone had made this change weeks before, forgotten about it and needed to reboot hundreds of servers.

A simple server reboot would have taken out a site, service or everything depending on the setup.

And don't for a minute think, well with dev, integration, puppet, chef, ansible, whatever pipeline tools you use this won't happen to you.

Because the strongest weapon against this type of mistake isn't a code pipeline, its actually people. The peer review, the change request, the human gate between testing and production. The one place in many companies I've seen where a tick is put in a box because people are busy and there are no repercussions for doing that.

When a person writes the change, they should include a level of detail that you could walk out into the street and pick anyone and that person could run the change and regress it if needs be is the advice I give. (I actually say write it so I could do it).

Random Errors are also put in changes on purpose to see if the Peer review stage is being taken seriously as this is the human safeguard.

As someone once said to me, automation is great, it helps you deploy mistakes quicker..

Just a few Sunday afternoon thoughts and a hands up to making a stupid mistake.

To view or add a comment, sign in

It all went wrong overnight

David Field

Recommended by LinkedIn

More articles by David Field

Others also viewed

DevOps Day 83: Project-4 Set up a Web-App using Docker Swarm

Project 1 - DevOps (Implementation)

For tech people (especially DevOps & Developers) who wonder what are Docker and Kubernetes ,what's the hype about them,why do I need to learn these?

90Days of DevOps Challenge/Day3

KubeVirt, Docker and CI/CD

Ansible Inventory File At A Glance

The DevOps Odyssey, Part 3: GitOps on K3s with Argo CD — Self-Managing Infrastructure from a Single Commit

Bridging the Gap: Infrastructure vs. Application CI/CD in Modern DevOps

🚀 Building a Full CI/CD Pipeline with Jenkins, Docker, KOPS, and ArgoCD on AWS

Creating private secure Docker registry with authentication

Explore content categories

Recommended by LinkedIn

More articles by David Field

If this is you, talk to someone today.

How did Threads do what others couldn't?

New revamp of my photo gallery

Displaying different

Convergence its important and right now and I think Google is best placed to take it mainstream..

To save Twitter think smaller not bigger

The Nexus 6p is 90% of what i'm looking for from a phone.

Being the family IT geek..

Wileyfox, support done wrong..

What laptop should you buy this year?

Others also viewed

DevOps Day 83: Project-4 Set up a Web-App using Docker Swarm

Project 1 - DevOps (Implementation)

For tech people (especially DevOps & Developers) who wonder what are Docker and Kubernetes ,what's the hype about them,why do I need to learn these?

90Days of DevOps Challenge/Day3

KubeVirt, Docker and CI/CD

Ansible Inventory File At A Glance

The DevOps Odyssey, Part 3: GitOps on K3s with Argo CD — Self-Managing Infrastructure from a Single Commit

Bridging the Gap: Infrastructure vs. Application CI/CD in Modern DevOps

🚀 Building a Full CI/CD Pipeline with Jenkins, Docker, KOPS, and ArgoCD on AWS

Creating private secure Docker registry with authentication

Explore content categories