Mastering Linux Server Patching: From Manual Processes to Ansible Automation

Mayank Mishra

Published Oct 12, 2025

As Linux administrators, we've all been there—staring at a spreadsheet of servers that need patching, wondering how to minimize downtime while maximizing reliability. After years of managing patch cycles across diverse environments, I've learned that successful patching isn't just about running yum update. It's about process, documentation, and increasingly, automation.

Let me share a comprehensive approach that works whether you're managing 5 servers or 500.

The Foundation: A Proven Manual Process

Before we automate anything, we need a solid process. Here's the framework I've refined over the years:

Phase 1: Planning & Governance (Don't Skip This!)

Why it matters: I've seen patches rolled back at 3 AM because someone forgot to notify the database team. Proper planning prevents painful nights.

Key activities:

Schedule strategically – Work with stakeholders to identify low-impact windows
Raise your Change Request – Use your ITIL tools (ServiceNow, Remedy, Jira)
Get CAB approval – Yes, even for "routine" patches. Your future self will thank you

Pro tip: For environments with 10+ servers, create a patching calendar. I maintain one showing which server groups get patched which week. It's saved countless conflicts.

Phase 2: Pre-Patching Intelligence Gathering

This is where junior admins often rush, and senior admins slow down. Collecting the right data before patching can mean the difference between a smooth operation and a career-defining incident.

Essential checks:

# System baseline
uptime
hostname  
uname -a

# Storage health (critical for preventing boot issues)
df -h
cat /etc/fstab
lsblk
vgdisplay
lvdisplay

# Network configuration
ip a
route -n
cat /etc/resolv.conf  # Back this up! DNS often breaks post-reboot

# For clustered environments
pcs status  # Pacemaker/Corosync
hastatus    # VCS clusters

The backup imperative:

Verify your latest snapshot exists and is valid
Backup these files manually: /etc/passwd /etc/group /etc/resolv.conf (DNS entries can disappear after reboot) /etc/fstab (if you're modifying storage)

Package exclusions: Always confirm with application teams. Common exclusions include:

Kernel updates (if application certified for specific kernel)
httpd, nginx (web server compatibility)
MySQL, PostgreSQL libraries (database dependencies)
Java versions (application runtime dependencies)

Phase 3: Execution (The Main Event)

Once you're in the maintenance window, work systematically:

# Clean and check
yum clean all && yum check-update

# Apply patches (with exclusions if needed)
yum update -y --exclude=kernel* --exclude=httpd*

# Reboot
reboot

Real-world wisdom: I always keep a console session open (iDRAC, iLO, or KVM) during reboots. Network-based SSH isn't reliable when troubleshooting boot issues.

Phase 4: Post-Patching Validation

This phase determines if you sleep peacefully or get paged at midnight.

# Verify kernel version
uname -r

# Health checks
top        # CPU and load
free -m    # Memory usage
df -h      # Disk space
uptime     # System load

# Service validation
systemctl status <your-critical-services>

# Log analysis
tail -100 /var/log/messages
dmesg | grep -i error

Checklist approach:

✓ Kernel version updated (if not excluded)
✓ All filesystems mounted correctly
✓ Network connectivity verified
✓ Application services running
✓ Database connections functional
✓ Cluster status healthy (if applicable)
✓ No critical errors in logs

Phase 5: Closure & Documentation

The paperwork matters. Future you (or your replacement) will need this.

Notify all stakeholders: completion time, any issues encountered
Update your Change Request with detailed notes
Close the CR formally
Update your CMDB if configurations changed

Level Up: Automation with Ansible

Once you've mastered the manual process, automation becomes your force multiplier. Here's how I've implemented Ansible for patch management across 100+ servers.

The Ansible Architecture

Directory structure:

Recommended by LinkedIn

Automating Windows Servers Configurations with…

ITGix Ltd 2 years ago

Red Hat Enterprise Linux 9.5: Simplifying Enterprise IT

Adeolu Oluade 1 year ago

Ansible Automation with Chocolatey for Windows Package…

David Rojas 6 years ago

linux-patching/
├── inventory/
│   ├── production
│   └── development
├── group_vars/
│   ├── all.yml
│   └── production.yml
├── playbooks/
│   ├── pre-patch-checks.yml
│   ├── patch-servers.yml
│   ├── post-patch-validation.yml
│   └── rollback.yml
├── roles/
│   ├── pre_checks/
│   ├── patching/
│   └── validation/
└── ansible.cfg

Playbook 1: Pre-Patching Intelligence

This playbook captures everything we did manually, but across dozens of servers simultaneously.

---
# playbooks/pre-patch-checks.yml
- name: Pre-Patching System Checks
  hosts: "{{ target_hosts | default('all') }}"
  become: yes
  gather_facts: yes
  
  vars:
    report_dir: "/var/log/patching/pre-patch-{{ ansible_date_time.date }}"
    
  tasks:
    - name: Create report directory
      file:
        path: "{{ report_dir }}"
        state: directory
        mode: '0755'
    
    - name: Gather system information
      shell: |
        echo "=== System Info ===" > {{ report_dir }}/system_info.txt
        uptime >> {{ report_dir }}/system_info.txt
        hostname >> {{ report_dir }}/system_info.txt
        uname -a >> {{ report_dir }}/system_info.txt
    
    - name: Collect storage information
      shell: |
        echo "=== Storage Info ===" > {{ report_dir }}/storage_info.txt
        df -h >> {{ report_dir }}/storage_info.txt
        echo -e "\n=== Block Devices ===" >> {{ report_dir }}/storage_info.txt
        lsblk >> {{ report_dir }}/storage_info.txt
        echo -e "\n=== Volume Groups ===" >> {{ report_dir }}/storage_info.txt
        vgdisplay >> {{ report_dir }}/storage_info.txt 2>/dev/null || echo "LVM not configured"
    
    - name: Backup critical files
      copy:
        src: "{{ item }}"
        dest: "{{ report_dir }}/{{ item | basename }}.backup"
        remote_src: yes
      loop:
        - /etc/passwd
        - /etc/group
        - /etc/resolv.conf
        - /etc/fstab
      ignore_errors: yes
    
    - name: Check available updates
      shell: yum check-update
      register: available_updates
      failed_when: false
      changed_when: false
    
    - name: Save available updates
      copy:
        content: "{{ available_updates.stdout }}"
        dest: "{{ report_dir }}/available_updates.txt"
    
    - name: Check cluster status (if applicable)
      shell: |
        if command -v pcs &> /dev/null; then
          pcs status > {{ report_dir }}/cluster_status.txt
        elif command -v hastatus &> /dev/null; then
          hastatus -sum > {{ report_dir }}/cluster_status.txt
        else
          echo "No cluster detected" > {{ report_dir }}/cluster_status.txt
        fi
      ignore_errors: yes
    
    - name: Generate pre-patch report summary
      shell: |
        cat << EOF > {{ report_dir }}/summary.txt
        Pre-Patch Report: {{ ansible_hostname }}
        Date: {{ ansible_date_time.iso8601 }}
        Kernel: $(uname -r)
        Uptime: $(uptime)
        Disk Usage: $(df -h / | tail -1 | awk '{print $5}')
        Updates Available: $(yum check-update | grep -c '^[a-zA-Z]' || echo "0")
        EOF
    
    - name: Fetch reports to control node
      fetch:
        src: "{{ report_dir }}/summary.txt"
        dest: "./reports/{{ ansible_hostname }}_pre_patch_{{ ansible_date_time.date }}.txt"
        flat: yes

Playbook 2: Intelligent Patching

This handles the actual patching with safety checks and exclusions.

---
# playbooks/patch-servers.yml
- name: Linux Server Patching
  hosts: "{{ target_hosts | default('all') }}"
  become: yes
  serial: "{{ batch_size | default(5) }}"  # Patch in batches
  
  vars:
    exclude_packages: "{{ package_exclusions | default([]) }}"
    reboot_required: "{{ auto_reboot | default(true) }}"
    pre_patch_snapshot: "{{ create_snapshot | default(false) }}"
    
  tasks:
    - name: Check if server is in maintenance mode
      fail:
        msg: "Server not in maintenance mode. Set maintenance_mode=true in inventory."
      when: maintenance_mode is not defined or not maintenance_mode
    
    - name: Clean yum cache
      command: yum clean all
      changed_when: true
    
    - name: Check for available updates
      shell: yum check-update
      register: updates_check
      failed_when: false
      changed_when: false
    
    - name: Display available updates
      debug:
        msg: "{{ updates_check.stdout_lines }}"
      when: updates_check.stdout != ""
    
    - name: Build exclusion string
      set_fact:
        exclusion_string: "{{ exclude_packages | map('regex_replace', '^(.*)$', '--exclude=\\1') | join(' ') }}"
      when: exclude_packages | length > 0
    
    - name: Apply system updates (with exclusions)
      shell: "yum update -y {{ exclusion_string | default('') }}"
      register: yum_update
      when: updates_check.rc == 100  # Updates available
    
    - name: Check if reboot is required
      stat:
        path: /var/run/reboot-required
      register: reboot_required_file
    
    - name: Determine if kernel was updated
      shell: |
        CURRENT_KERNEL=$(uname -r)
        LATEST_KERNEL=$(rpm -q kernel --last | head -1 | awk '{print $1}' | sed 's/kernel-//')
        if [ "$CURRENT_KERNEL" != "$LATEST_KERNEL" ]; then
          echo "REBOOT_NEEDED"
        else
          echo "NO_REBOOT"
        fi
      register: kernel_check
      changed_when: false
    
    - name: Reboot server if required
      reboot:
        msg: "Rebooting for kernel updates"
        pre_reboot_delay: 5
        post_reboot_delay: 30
        reboot_timeout: 600
      when: 
        - reboot_required
        - kernel_check.stdout == "REBOOT_NEEDED" or reboot_required_file.stat.exists
    
    - name: Wait for server to come back online
      wait_for_connection:
        delay: 10
        timeout: 300
      when: reboot_required

Playbook 3: Post-Patch Validation

Automated verification ensures nothing broke during patching.

---
# playbooks/post-patch-validation.yml
- name: Post-Patching Validation
  hosts: "{{ target_hosts | default('all') }}"
  become: yes
  gather_facts: yes
  
  vars:
    report_dir: "/var/log/patching/post-patch-{{ ansible_date_time.date }}"
    critical_services: "{{ services_to_check | default(['sshd', 'crond']) }}"
    
  tasks:
    - name: Create post-patch report directory
      file:
        path: "{{ report_dir }}"
        state: directory
        mode: '0755'
    
    - name: Verify kernel version
      shell: uname -r
      register: current_kernel
      changed_when: false
    
    - name: Check system health metrics
      shell: |
        echo "=== System Health ===" > {{ report_dir }}/health_check.txt
        echo "Uptime: $(uptime)" >> {{ report_dir }}/health_check.txt
        echo -e "\n=== Memory ===" >> {{ report_dir }}/health_check.txt
        free -m >> {{ report_dir }}/health_check.txt
        echo -e "\n=== CPU Load ===" >> {{ report_dir }}/health_check.txt
        top -bn1 | head -20 >> {{ report_dir }}/health_check.txt
        echo -e "\n=== Disk Usage ===" >> {{ report_dir }}/health_check.txt
        df -h >> {{ report_dir }}/health_check.txt
    
    - name: Verify all filesystems mounted
      shell: |
        mount | grep -v tmpfs > {{ report_dir }}/mounts.txt
        diff -u /etc/fstab <(mount | awk '{print $1, $3, $5}') || true
      register: mount_check
      failed_when: false
    
    - name: Check critical services
      service_facts:
    
    - name: Verify critical services are running
      assert:
        that:
          - ansible_facts.services[item + '.service'].state == 'running'
        fail_msg: "Service {{ item }} is not running!"
        success_msg: "Service {{ item }} is running"
      loop: "{{ critical_services }}"
      when: ansible_facts.services[item + '.service'] is defined
    
    - name: Check for errors in system logs
      shell: |
        echo "=== Recent Errors ===" > {{ report_dir }}/errors.txt
        tail -200 /var/log/messages | grep -i error >> {{ report_dir }}/errors.txt || echo "No errors found"
        echo -e "\n=== dmesg Errors ===" >> {{ report_dir }}/errors.txt
        dmesg | grep -i error | tail -50 >> {{ report_dir }}/errors.txt || echo "No errors found"
    
    - name: Verify network connectivity
      shell: |
        echo "=== Network Interfaces ===" > {{ report_dir }}/network.txt
        ip a >> {{ report_dir }}/network.txt
        echo -e "\n=== Routing ===" >> {{ report_dir }}/network.txt
        route -n >> {{ report_dir }}/network.txt
        echo -e "\n=== DNS Resolution ===" >> {{ report_dir }}/network.txt
        nslookup google.com >> {{ report_dir }}/network.txt 2>&1 || echo "DNS resolution failed"
    
    - name: Check cluster status (if applicable)
      shell: |
        if command -v pcs &> /dev/null; then
          pcs status > {{ report_dir }}/cluster_status_post.txt
        elif command -v hastatus &> /dev/null; then
          hastatus -sum > {{ report_dir }}/cluster_status_post.txt
        fi
      ignore_errors: yes
    
    - name: Generate validation report
      shell: |
        cat << EOF > {{ report_dir }}/validation_summary.txt
        Post-Patch Validation: {{ ansible_hostname }}
        Date: {{ ansible_date_time.iso8601 }}
        Previous Kernel: {{ current_kernel.stdout }}
        Current Kernel: $(uname -r)
        Uptime: $(uptime)
        Critical Services Status:
        $(systemctl is-active {{ critical_services | join(' ') }})
        
        Disk Usage:
        $(df -h / | tail -1)
        
        Memory Usage:
        $(free -h | grep Mem)
        EOF
    
    - name: Fetch validation reports
      fetch:
        src: "{{ report_dir }}/validation_summary.txt"
        dest: "./reports/{{ ansible_hostname }}_post_patch_{{ ansible_date_time.date }}.txt"
        flat: yes
    
    - name: Mark patching as successful
      lineinfile:
        path: /var/log/patching/patching_history.log
        line: "{{ ansible_date_time.iso8601 }} - Patching completed successfully - Kernel: $(uname -r)"
        create: yes

Inventory Configuration

Organize your servers logically for targeted patching:

# inventory/production
[web_servers]
web01.prod.example.com maintenance_mode=true
web02.prod.example.com maintenance_mode=true

[db_servers]
db01.prod.example.com maintenance_mode=true package_exclusions=['kernel*','mysql*']
db02.prod.example.com maintenance_mode=true package_exclusions=['kernel*','mysql*']

[app_servers]
app01.prod.example.com maintenance_mode=true
app02.prod.example.com maintenance_mode=true package_exclusions=['java*']

[production:children]
web_servers
db_servers
app_servers

[production:vars]
ansible_user=ansible
ansible_become_method=sudo
services_to_check=['sshd','crond','firewalld']
auto_reboot=true
batch_size=2

Group Variables for Different Environments

# group_vars/production.yml
---
# Patching configuration
package_exclusions: []
auto_reboot: true
create_snapshot: true
batch_size: 3

# Email notifications
email_notifications: true
notification_recipients:
  - devops@example.com
  - sysadmin@example.com

# Service validation
services_to_check:
  - sshd
  - crond
  - firewalld
  - rsyslog

# Backup settings
backup_critical_configs: true
backup_retention_days: 30

Execution Workflow

Here's how I run a complete patching cycle:

# Step 1: Pre-patch checks (generates reports, no changes)
ansible-playbook -i inventory/production playbooks/pre-patch-checks.yml \
  --limit web_servers

# Step 2: Review the reports in ./reports/
# Verify no issues before proceeding

# Step 3: Execute patching (in batches)
ansible-playbook -i inventory/production playbooks/patch-servers.yml \
  --limit web_servers \
  --extra-vars "batch_size=2"

# Step 4: Post-patch validation
ansible-playbook -i inventory/production playbooks/post-patch-validation.yml \
  --limit web_servers

# Step 5: Generate combined report
./generate_patching_report.sh web_servers

Advanced: Rollback Playbook

Because things don't always go as planned:

---
# playbooks/rollback.yml
- name: Rollback Patching (Emergency Use)
  hosts: "{{ target_hosts }}"
  become: yes
  
  tasks:
    - name: Confirm rollback intent
      pause:
        prompt: "WARNING: This will rollback to snapshot. Type 'YES' to continue"
      register: rollback_confirmation
    
    - name: Abort if not confirmed
      fail:
        msg: "Rollback cancelled by user"
      when: rollback_confirmation.user_input != "YES"
    
    - name: Restore critical configuration files
      copy:
        src: "/var/log/patching/pre-patch-{{ ansible_date_time.date }}/{{ item | basename }}.backup"
        dest: "{{ item }}"
        remote_src: yes
      loop:
        - /etc/resolv.conf
        - /etc/fstab
      ignore_errors: yes
    
    - name: Downgrade to previous kernel (if kernel updated)
      shell: |
        CURRENT_KERNEL=$(uname -r)
        PREVIOUS_KERNEL=$(rpm -q kernel --last | sed -n '2p' | awk '{print $1}' | sed 's/kernel-//')
        if [ "$CURRENT_KERNEL" != "$PREVIOUS_KERNEL" ]; then
          grubby --set-default=/boot/vmlinuz-$PREVIOUS_KERNEL
          echo "Kernel rollback configured. Reboot required."
        fi
      register: kernel_rollback
    
    - name: Reboot to previous kernel
      reboot:
        msg: "Rebooting to previous kernel version"
      when: "'Kernel rollback configured' in kernel_rollback.stdout"

Real-World Lessons from the Trenches

For those with 3-5 years experience:

Start with the manual process. Understand what can go wrong before you automate.
Always have console access before rebooting production servers.
The /etc/resolv.conf backup has saved me more times than I can count.

For mid-level engineers (5-10 years):

Batch your patching. Don't update all web servers simultaneously.
Build relationships with application teams—they'll tell you which packages are safe to update.
Maintain a patching calendar and stick to it. Predictability reduces stress.

For senior engineers (10-15 years):

Invest time in automation, but make it modular. Your playbooks should work whether you're patching 1 server or 100.
Create dashboards showing patch compliance. Management loves metrics.
Document your exclusions and WHY they exist. Future teams will need this context.

Measuring Success

Track these metrics to demonstrate value:

Patch compliance rate: % of servers on current patches
Mean time to patch: From approval to completion
Incident rate: Issues per 100 servers patched
Downtime: Actual vs. scheduled maintenance window
Automation coverage: % of patching done via Ansible vs. manual

In my current environment, we've achieved:

95% patch compliance across 200+ servers
Reduced average patching time from 2 hours to 45 minutes per server
Cut patching-related incidents by 70% through pre-checks
Batch-patch 20 servers in the time it used to take for 5

Final Thoughts

Patching isn't glamorous work, but it's foundational to security and stability. Whether you're manually patching your first server or automating hundreds, the principles remain the same: plan thoroughly, document everything, verify religiously.

The manual process teaches you what can go wrong. Ansible ensures it doesn't go wrong at scale.

What's your patching horror story? Drop it in the comments—we've all got one. And if you're implementing Ansible for patching, I'd love to hear what challenges you're facing.

Did you find this helpful? Share it with your team and follow me for more deep dives into Linux systems engineering and automation. :)

Charanjit Singh Cheema 4mo

Nice writeup Mayank. Just a small tip: variables that are specific to certain plays should be either passed through the extra vars or maintain in the main.yml file under the vars folder inside the role. Inventory and group vars should only contain global or common variables related to the inventory or groups. This approach is helpful when you plan to reuse the same inventory or groups for other playbooks as well.

1 Reaction

To view or add a comment, sign in

Mastering Linux Server Patching: From Manual Processes to Ansible Automation

Mayank Mishra

The Foundation: A Proven Manual Process

Phase 1: Planning & Governance (Don't Skip This!)

Phase 2: Pre-Patching Intelligence Gathering

Phase 3: Execution (The Main Event)

Phase 4: Post-Patching Validation

Phase 5: Closure & Documentation

Level Up: Automation with Ansible

The Ansible Architecture

Recommended by LinkedIn

Playbook 1: Pre-Patching Intelligence

Playbook 2: Intelligent Patching

Playbook 3: Post-Patch Validation

Inventory Configuration

Execution Workflow

Advanced: Rollback Playbook

Real-World Lessons from the Trenches

Measuring Success

Final Thoughts

More articles by Mayank Mishra

Others also viewed

Maximizing Efficiency and Security: The Essential Role of a Linux Administrator

Implementing Infrastructure as Code (IaC) with Terraform: Installing and Setting Up Terraform

The Linux Mistake Nobody Talks About

Top 4 Highlights of RedHat Enterprise Linux (RHEL) 8

Ansible Automation for Windows

15+ Essential Linux administration interview questions and detailed answers

Top 25 Linux Administration Interview Questions with Answers for 2025 – Expert Guide

RH294- Epert Session With 2 Best Expert Of Industry {Linux World}

ANSIBLE

Red Hat Enterprise Linux CoreOS (RHCOS)

Explore content categories

The Foundation: A Proven Manual Process

Phase 1: Planning & Governance (Don't Skip This!)

Phase 2: Pre-Patching Intelligence Gathering

Phase 3: Execution (The Main Event)

Phase 4: Post-Patching Validation

Phase 5: Closure & Documentation

Level Up: Automation with Ansible

The Ansible Architecture

Recommended by LinkedIn

Playbook 1: Pre-Patching Intelligence

Playbook 2: Intelligent Patching

Playbook 3: Post-Patch Validation

Inventory Configuration

Execution Workflow

Advanced: Rollback Playbook

Real-World Lessons from the Trenches

Measuring Success

Final Thoughts

More articles by Mayank Mishra

From AIX Dreams to Automation Reality: My journey with IBM/Kyndryl

The Troubleshooting Playbook: The Systematic Approach

Ansible at Scale: From 10 Servers to 10,000

Linux Performance Tuning: From Slow to Fast

Disaster Recovery Planning: The Day Everything Went Wrong (And How We Survived)

Linux Security Hardening: From Vulnerability to Fortress

Mastering Git Workflow: From Chaos to Controlled Releases

🔐 SecOps Automated — Article 1 From Firewalls to Playbooks: Why Security Needs Automation

⚔️ SecOps Automated — Why This Series Exists

⚙️ Ansible Unleashed | Article 20 The Future of Ansible — AI, Autonomy, and Next-Gen Automation at Scale

Others also viewed

Maximizing Efficiency and Security: The Essential Role of a Linux Administrator

Implementing Infrastructure as Code (IaC) with Terraform: Installing and Setting Up Terraform

The Linux Mistake Nobody Talks About

Top 4 Highlights of RedHat Enterprise Linux (RHEL) 8

Ansible Automation for Windows

15+ Essential Linux administration interview questions and detailed answers

Top 25 Linux Administration Interview Questions with Answers for 2025 – Expert Guide

RH294- Epert Session With 2 Best Expert Of Industry {Linux World}

ANSIBLE

Red Hat Enterprise Linux CoreOS (RHCOS)

Explore content categories