Mastering Linux Server Patching: From Manual Processes to Ansible Automation
As Linux administrators, we've all been there—staring at a spreadsheet of servers that need patching, wondering how to minimize downtime while maximizing reliability. After years of managing patch cycles across diverse environments, I've learned that successful patching isn't just about running yum update. It's about process, documentation, and increasingly, automation.
Let me share a comprehensive approach that works whether you're managing 5 servers or 500.
The Foundation: A Proven Manual Process
Before we automate anything, we need a solid process. Here's the framework I've refined over the years:
Phase 1: Planning & Governance (Don't Skip This!)
Why it matters: I've seen patches rolled back at 3 AM because someone forgot to notify the database team. Proper planning prevents painful nights.
Key activities:
Pro tip: For environments with 10+ servers, create a patching calendar. I maintain one showing which server groups get patched which week. It's saved countless conflicts.
Phase 2: Pre-Patching Intelligence Gathering
This is where junior admins often rush, and senior admins slow down. Collecting the right data before patching can mean the difference between a smooth operation and a career-defining incident.
Essential checks:
# System baseline
uptime
hostname
uname -a
# Storage health (critical for preventing boot issues)
df -h
cat /etc/fstab
lsblk
vgdisplay
lvdisplay
# Network configuration
ip a
route -n
cat /etc/resolv.conf # Back this up! DNS often breaks post-reboot
# For clustered environments
pcs status # Pacemaker/Corosync
hastatus # VCS clusters
The backup imperative:
Package exclusions: Always confirm with application teams. Common exclusions include:
Phase 3: Execution (The Main Event)
Once you're in the maintenance window, work systematically:
# Clean and check
yum clean all && yum check-update
# Apply patches (with exclusions if needed)
yum update -y --exclude=kernel* --exclude=httpd*
# Reboot
reboot
Real-world wisdom: I always keep a console session open (iDRAC, iLO, or KVM) during reboots. Network-based SSH isn't reliable when troubleshooting boot issues.
Phase 4: Post-Patching Validation
This phase determines if you sleep peacefully or get paged at midnight.
# Verify kernel version
uname -r
# Health checks
top # CPU and load
free -m # Memory usage
df -h # Disk space
uptime # System load
# Service validation
systemctl status <your-critical-services>
# Log analysis
tail -100 /var/log/messages
dmesg | grep -i error
Checklist approach:
Phase 5: Closure & Documentation
The paperwork matters. Future you (or your replacement) will need this.
Level Up: Automation with Ansible
Once you've mastered the manual process, automation becomes your force multiplier. Here's how I've implemented Ansible for patch management across 100+ servers.
The Ansible Architecture
Directory structure:
Recommended by LinkedIn
linux-patching/
├── inventory/
│ ├── production
│ └── development
├── group_vars/
│ ├── all.yml
│ └── production.yml
├── playbooks/
│ ├── pre-patch-checks.yml
│ ├── patch-servers.yml
│ ├── post-patch-validation.yml
│ └── rollback.yml
├── roles/
│ ├── pre_checks/
│ ├── patching/
│ └── validation/
└── ansible.cfg
Playbook 1: Pre-Patching Intelligence
This playbook captures everything we did manually, but across dozens of servers simultaneously.
---
# playbooks/pre-patch-checks.yml
- name: Pre-Patching System Checks
hosts: "{{ target_hosts | default('all') }}"
become: yes
gather_facts: yes
vars:
report_dir: "/var/log/patching/pre-patch-{{ ansible_date_time.date }}"
tasks:
- name: Create report directory
file:
path: "{{ report_dir }}"
state: directory
mode: '0755'
- name: Gather system information
shell: |
echo "=== System Info ===" > {{ report_dir }}/system_info.txt
uptime >> {{ report_dir }}/system_info.txt
hostname >> {{ report_dir }}/system_info.txt
uname -a >> {{ report_dir }}/system_info.txt
- name: Collect storage information
shell: |
echo "=== Storage Info ===" > {{ report_dir }}/storage_info.txt
df -h >> {{ report_dir }}/storage_info.txt
echo -e "\n=== Block Devices ===" >> {{ report_dir }}/storage_info.txt
lsblk >> {{ report_dir }}/storage_info.txt
echo -e "\n=== Volume Groups ===" >> {{ report_dir }}/storage_info.txt
vgdisplay >> {{ report_dir }}/storage_info.txt 2>/dev/null || echo "LVM not configured"
- name: Backup critical files
copy:
src: "{{ item }}"
dest: "{{ report_dir }}/{{ item | basename }}.backup"
remote_src: yes
loop:
- /etc/passwd
- /etc/group
- /etc/resolv.conf
- /etc/fstab
ignore_errors: yes
- name: Check available updates
shell: yum check-update
register: available_updates
failed_when: false
changed_when: false
- name: Save available updates
copy:
content: "{{ available_updates.stdout }}"
dest: "{{ report_dir }}/available_updates.txt"
- name: Check cluster status (if applicable)
shell: |
if command -v pcs &> /dev/null; then
pcs status > {{ report_dir }}/cluster_status.txt
elif command -v hastatus &> /dev/null; then
hastatus -sum > {{ report_dir }}/cluster_status.txt
else
echo "No cluster detected" > {{ report_dir }}/cluster_status.txt
fi
ignore_errors: yes
- name: Generate pre-patch report summary
shell: |
cat << EOF > {{ report_dir }}/summary.txt
Pre-Patch Report: {{ ansible_hostname }}
Date: {{ ansible_date_time.iso8601 }}
Kernel: $(uname -r)
Uptime: $(uptime)
Disk Usage: $(df -h / | tail -1 | awk '{print $5}')
Updates Available: $(yum check-update | grep -c '^[a-zA-Z]' || echo "0")
EOF
- name: Fetch reports to control node
fetch:
src: "{{ report_dir }}/summary.txt"
dest: "./reports/{{ ansible_hostname }}_pre_patch_{{ ansible_date_time.date }}.txt"
flat: yes
Playbook 2: Intelligent Patching
This handles the actual patching with safety checks and exclusions.
---
# playbooks/patch-servers.yml
- name: Linux Server Patching
hosts: "{{ target_hosts | default('all') }}"
become: yes
serial: "{{ batch_size | default(5) }}" # Patch in batches
vars:
exclude_packages: "{{ package_exclusions | default([]) }}"
reboot_required: "{{ auto_reboot | default(true) }}"
pre_patch_snapshot: "{{ create_snapshot | default(false) }}"
tasks:
- name: Check if server is in maintenance mode
fail:
msg: "Server not in maintenance mode. Set maintenance_mode=true in inventory."
when: maintenance_mode is not defined or not maintenance_mode
- name: Clean yum cache
command: yum clean all
changed_when: true
- name: Check for available updates
shell: yum check-update
register: updates_check
failed_when: false
changed_when: false
- name: Display available updates
debug:
msg: "{{ updates_check.stdout_lines }}"
when: updates_check.stdout != ""
- name: Build exclusion string
set_fact:
exclusion_string: "{{ exclude_packages | map('regex_replace', '^(.*)$', '--exclude=\\1') | join(' ') }}"
when: exclude_packages | length > 0
- name: Apply system updates (with exclusions)
shell: "yum update -y {{ exclusion_string | default('') }}"
register: yum_update
when: updates_check.rc == 100 # Updates available
- name: Check if reboot is required
stat:
path: /var/run/reboot-required
register: reboot_required_file
- name: Determine if kernel was updated
shell: |
CURRENT_KERNEL=$(uname -r)
LATEST_KERNEL=$(rpm -q kernel --last | head -1 | awk '{print $1}' | sed 's/kernel-//')
if [ "$CURRENT_KERNEL" != "$LATEST_KERNEL" ]; then
echo "REBOOT_NEEDED"
else
echo "NO_REBOOT"
fi
register: kernel_check
changed_when: false
- name: Reboot server if required
reboot:
msg: "Rebooting for kernel updates"
pre_reboot_delay: 5
post_reboot_delay: 30
reboot_timeout: 600
when:
- reboot_required
- kernel_check.stdout == "REBOOT_NEEDED" or reboot_required_file.stat.exists
- name: Wait for server to come back online
wait_for_connection:
delay: 10
timeout: 300
when: reboot_required
Playbook 3: Post-Patch Validation
Automated verification ensures nothing broke during patching.
---
# playbooks/post-patch-validation.yml
- name: Post-Patching Validation
hosts: "{{ target_hosts | default('all') }}"
become: yes
gather_facts: yes
vars:
report_dir: "/var/log/patching/post-patch-{{ ansible_date_time.date }}"
critical_services: "{{ services_to_check | default(['sshd', 'crond']) }}"
tasks:
- name: Create post-patch report directory
file:
path: "{{ report_dir }}"
state: directory
mode: '0755'
- name: Verify kernel version
shell: uname -r
register: current_kernel
changed_when: false
- name: Check system health metrics
shell: |
echo "=== System Health ===" > {{ report_dir }}/health_check.txt
echo "Uptime: $(uptime)" >> {{ report_dir }}/health_check.txt
echo -e "\n=== Memory ===" >> {{ report_dir }}/health_check.txt
free -m >> {{ report_dir }}/health_check.txt
echo -e "\n=== CPU Load ===" >> {{ report_dir }}/health_check.txt
top -bn1 | head -20 >> {{ report_dir }}/health_check.txt
echo -e "\n=== Disk Usage ===" >> {{ report_dir }}/health_check.txt
df -h >> {{ report_dir }}/health_check.txt
- name: Verify all filesystems mounted
shell: |
mount | grep -v tmpfs > {{ report_dir }}/mounts.txt
diff -u /etc/fstab <(mount | awk '{print $1, $3, $5}') || true
register: mount_check
failed_when: false
- name: Check critical services
service_facts:
- name: Verify critical services are running
assert:
that:
- ansible_facts.services[item + '.service'].state == 'running'
fail_msg: "Service {{ item }} is not running!"
success_msg: "Service {{ item }} is running"
loop: "{{ critical_services }}"
when: ansible_facts.services[item + '.service'] is defined
- name: Check for errors in system logs
shell: |
echo "=== Recent Errors ===" > {{ report_dir }}/errors.txt
tail -200 /var/log/messages | grep -i error >> {{ report_dir }}/errors.txt || echo "No errors found"
echo -e "\n=== dmesg Errors ===" >> {{ report_dir }}/errors.txt
dmesg | grep -i error | tail -50 >> {{ report_dir }}/errors.txt || echo "No errors found"
- name: Verify network connectivity
shell: |
echo "=== Network Interfaces ===" > {{ report_dir }}/network.txt
ip a >> {{ report_dir }}/network.txt
echo -e "\n=== Routing ===" >> {{ report_dir }}/network.txt
route -n >> {{ report_dir }}/network.txt
echo -e "\n=== DNS Resolution ===" >> {{ report_dir }}/network.txt
nslookup google.com >> {{ report_dir }}/network.txt 2>&1 || echo "DNS resolution failed"
- name: Check cluster status (if applicable)
shell: |
if command -v pcs &> /dev/null; then
pcs status > {{ report_dir }}/cluster_status_post.txt
elif command -v hastatus &> /dev/null; then
hastatus -sum > {{ report_dir }}/cluster_status_post.txt
fi
ignore_errors: yes
- name: Generate validation report
shell: |
cat << EOF > {{ report_dir }}/validation_summary.txt
Post-Patch Validation: {{ ansible_hostname }}
Date: {{ ansible_date_time.iso8601 }}
Previous Kernel: {{ current_kernel.stdout }}
Current Kernel: $(uname -r)
Uptime: $(uptime)
Critical Services Status:
$(systemctl is-active {{ critical_services | join(' ') }})
Disk Usage:
$(df -h / | tail -1)
Memory Usage:
$(free -h | grep Mem)
EOF
- name: Fetch validation reports
fetch:
src: "{{ report_dir }}/validation_summary.txt"
dest: "./reports/{{ ansible_hostname }}_post_patch_{{ ansible_date_time.date }}.txt"
flat: yes
- name: Mark patching as successful
lineinfile:
path: /var/log/patching/patching_history.log
line: "{{ ansible_date_time.iso8601 }} - Patching completed successfully - Kernel: $(uname -r)"
create: yes
Inventory Configuration
Organize your servers logically for targeted patching:
# inventory/production
[web_servers]
web01.prod.example.com maintenance_mode=true
web02.prod.example.com maintenance_mode=true
[db_servers]
db01.prod.example.com maintenance_mode=true package_exclusions=['kernel*','mysql*']
db02.prod.example.com maintenance_mode=true package_exclusions=['kernel*','mysql*']
[app_servers]
app01.prod.example.com maintenance_mode=true
app02.prod.example.com maintenance_mode=true package_exclusions=['java*']
[production:children]
web_servers
db_servers
app_servers
[production:vars]
ansible_user=ansible
ansible_become_method=sudo
services_to_check=['sshd','crond','firewalld']
auto_reboot=true
batch_size=2
Group Variables for Different Environments
# group_vars/production.yml
---
# Patching configuration
package_exclusions: []
auto_reboot: true
create_snapshot: true
batch_size: 3
# Email notifications
email_notifications: true
notification_recipients:
- devops@example.com
- sysadmin@example.com
# Service validation
services_to_check:
- sshd
- crond
- firewalld
- rsyslog
# Backup settings
backup_critical_configs: true
backup_retention_days: 30
Execution Workflow
Here's how I run a complete patching cycle:
# Step 1: Pre-patch checks (generates reports, no changes)
ansible-playbook -i inventory/production playbooks/pre-patch-checks.yml \
--limit web_servers
# Step 2: Review the reports in ./reports/
# Verify no issues before proceeding
# Step 3: Execute patching (in batches)
ansible-playbook -i inventory/production playbooks/patch-servers.yml \
--limit web_servers \
--extra-vars "batch_size=2"
# Step 4: Post-patch validation
ansible-playbook -i inventory/production playbooks/post-patch-validation.yml \
--limit web_servers
# Step 5: Generate combined report
./generate_patching_report.sh web_servers
Advanced: Rollback Playbook
Because things don't always go as planned:
---
# playbooks/rollback.yml
- name: Rollback Patching (Emergency Use)
hosts: "{{ target_hosts }}"
become: yes
tasks:
- name: Confirm rollback intent
pause:
prompt: "WARNING: This will rollback to snapshot. Type 'YES' to continue"
register: rollback_confirmation
- name: Abort if not confirmed
fail:
msg: "Rollback cancelled by user"
when: rollback_confirmation.user_input != "YES"
- name: Restore critical configuration files
copy:
src: "/var/log/patching/pre-patch-{{ ansible_date_time.date }}/{{ item | basename }}.backup"
dest: "{{ item }}"
remote_src: yes
loop:
- /etc/resolv.conf
- /etc/fstab
ignore_errors: yes
- name: Downgrade to previous kernel (if kernel updated)
shell: |
CURRENT_KERNEL=$(uname -r)
PREVIOUS_KERNEL=$(rpm -q kernel --last | sed -n '2p' | awk '{print $1}' | sed 's/kernel-//')
if [ "$CURRENT_KERNEL" != "$PREVIOUS_KERNEL" ]; then
grubby --set-default=/boot/vmlinuz-$PREVIOUS_KERNEL
echo "Kernel rollback configured. Reboot required."
fi
register: kernel_rollback
- name: Reboot to previous kernel
reboot:
msg: "Rebooting to previous kernel version"
when: "'Kernel rollback configured' in kernel_rollback.stdout"
Real-World Lessons from the Trenches
For those with 3-5 years experience:
For mid-level engineers (5-10 years):
For senior engineers (10-15 years):
Measuring Success
Track these metrics to demonstrate value:
In my current environment, we've achieved:
Final Thoughts
Patching isn't glamorous work, but it's foundational to security and stability. Whether you're manually patching your first server or automating hundreds, the principles remain the same: plan thoroughly, document everything, verify religiously.
The manual process teaches you what can go wrong. Ansible ensures it doesn't go wrong at scale.
What's your patching horror story? Drop it in the comments—we've all got one. And if you're implementing Ansible for patching, I'd love to hear what challenges you're facing.
Did you find this helpful? Share it with your team and follow me for more deep dives into Linux systems engineering and automation. :)
Nice writeup Mayank. Just a small tip: variables that are specific to certain plays should be either passed through the extra vars or maintain in the main.yml file under the vars folder inside the role. Inventory and group vars should only contain global or common variables related to the inventory or groups. This approach is helpful when you plan to reuse the same inventory or groups for other playbooks as well.