10 Scenario-Based AWS DevOps Interview Questions with Answer

10 Scenario-Based AWS DevOps Interview Questions with Answer

As a hiring manager or a seasoned DevOps professional, you know that the real test of a candidate isn't in their ability to recite AWS service definitions. It's in their capacity to apply that knowledge to complex, real-world problems. Scenario-based questions cut through the buzzwords and reveal a candidate's problem-solving methodology, understanding of core principles, and hands-on experience.

This article presents 10 critical scenario-based interview questions for AWS DevOps roles. The answers are designed not just to be "correct," but to demonstrate the depth of thinking and architectural best practices that separate good engineers from great ones.


1. The Midnight Page: High CPU on Production Web Servers

Scenario: You get paged at 2 AM because the CPU utilization on your EC2-based web servers in the production Auto Scaling Group (ASG) is consistently above 90%. The application is slow, and users are reporting timeouts. What is your immediate step-by-step approach to diagnose and resolve this issue?

Answer:

My first action is to stabilize the system to stop the user impact. I would immediately increase the desired capacity of the Auto Scaling Group to spin up new instances, absorbing the load while I investigate. This is a tactical fix, not a root-cause solution. Simultaneously, I would dive into Amazon CloudWatch. I’d check the CPUUtilization metric and, crucially, the StatusCheckFailed metrics to see if the instances are failing health checks. I would then use AWS Systems Manager Session Manager to securely connect to one of the affected instances without needing SSH keys.

Once inside, I would use standard Linux troubleshooting commands like top, htop, or vmstat to identify the process consuming excessive CPU. Is it the application server (e.g., a Java process), a database connection leak, or something else like a crypto-miner from a security breach? I would also check the application logs (e.g., in Amazon CloudWatch Logs) for any recent errors or a spike in traffic that correlates with the event. The resolution depends on the finding: it could be terminating a bad process, rolling back a recent deployment if it was a bad code push, or scaling the underlying database if it's a bottleneck. The key is to restore service first, then find and fix the root cause.


2. The Cost Anomaly: A Sudden Spike in the AWS Bill

Scenario: Your finance team alerts you to a 300% spike in your AWS bill compared to last month. You are asked to lead the investigation. Where do you start, and how do you identify the culprit?

Answer:

I would start with AWS Cost Explorer, filtering the cost data by the current month and comparing it to the previous month. The most powerful first step is to "Group by" the Service dimension to see which AWS service is responsible for the majority of the increase. Often, it's one of the usual suspects: EC2, Data Transfer, or S3.

If EC2 is the culprit, I would check if there are forgotten instances running (e.g., old development environments), or if an Auto Scaling Group's scaling policy is too aggressive, leading to over-provisioning. If S3 costs are high, I would investigate if there are millions of small objects (increasing API request costs) or if lifecycle policies have failed to move infrequently accessed data to cheaper storage tiers like S3 Glacier. A critical check is for unintended data transfer, especially cross-region or to the public internet. I would use the VPC Flow Logs and Athena to query for large data transfers. The goal is to pinpoint the specific resource, tag it correctly for future cost allocation, and implement Budgets and Cost Anomaly Detection alerts to get notified of such spikes automatically in the future.


3. The Broken Pipeline: CodeDeploy Deployment Failing

Scenario: Your AWS CodePipeline, which uses CodeDeploy to deploy to an EC2 Auto Scaling Group, has started failing during the ApplicationStop lifecycle event. The error is vague: "Script at specified location: scripts/application_stop.sh failed with exit code 1." How do you debug this?

Answer:

This is a classic case where the devil is in the execution details. My first step is to navigate to the AWS CodeDeploy console, find the specific deployment ID that failed, and examine the execution logs. CodeDeploy provides detailed logs for each lifecycle event, which will show the exact output and error from the application_stop.sh script.

The issue is almost certainly within that script. Common problems include: insufficient permissions for the CodeDeploy agent's IAM role to execute a command (e.g., to stop a Docker container), the script referencing a file path that doesn't exist on the new instances, or a syntax error in the shell script itself. I would use AWS Systems Manager to run a command on one of the target instances to manually execute the script (or parts of it) in a dry-run mode, observing the output. The fix would involve correcting the script, ensuring the IAM role has the necessary permissions (e.g., ec2:DescribeInstances, ecs:UpdateService if managing ECS), and testing it in a staging environment before re-triggering the pipeline.


4. The Secret Leak: AWS Credentials Found in a Public GitHub Repo

Scenario: A developer accidentally pushed a configuration file containing an IAM Access Key and Secret to a public GitHub repository. The keys have been exposed for 6 hours. What is your emergency response plan?

Answer:

This is a security incident that requires immediate and decisive action. Step one is to revoke the compromised credentials. I would immediately navigate to the IAM console, find the user associated with the access key, and deactivate the key. Even better, I would delete the key entirely. If the key was for an IAM role, I would detach the policy from the role to nullify its permissions instantly.

Step two is assessment. I would use AWS CloudTrail to investigate the key's usage during the 6-hour exposure window. I would run queries to see if there were any API calls made by that key, from which IP addresses, and what actions were performed. This is critical to understand the blast radius—did an attacker use it to spin up expensive EC2 instances, access sensitive S3 data, or delete resources? Based on the findings, I would then execute step three: remediation. This could involve rolling any rotated secrets (e.g., database passwords) that the key had access to, terminating unauthorized resources, and notifying stakeholders. Finally, I would implement preventive measures like using AWS Secrets Manager or git-secrets hooks to scan for credentials pre-commit.


5. The State File Lock: Terraform Plan is Blocked

Scenario: Your Terraform CI/CD pipeline is failing because a terraform plan command is unable to acquire a lock on the state file, which is stored in an S3 backend with DynamoDB locking. The error states another process is already holding the lock. How do you resolve this without causing corruption?

Answer:

A state lock is a safety feature to prevent concurrent modifications that could corrupt the state. My first step is to investigate who or what is holding the lock. I would check the DynamoDB table used for state locking. The table item for the lock will contain a LockID and Info attribute, which often includes the machine or CI/CD job ID that acquired it.

I would then check the status of our CI/CD pipelines (e.g., Jenkins jobs or CodePipeline executions) to see if another pipeline run is stuck or still running. If I find a job that has clearly failed or been terminated without releasing the lock, I can proceed with force-unlocking. The command would be terraform force-unlock -force <LOCK_ID>. Crucially, I must be 100% certain that the process holding the lock is no longer active. Running this command while a Terraform apply is in progress will almost certainly lead to a corrupted state. If the lock is held by a teammate, I would coordinate with them first. The long-term fix is to ensure pipelines are designed to run sequentially or to use workspaces to isolate state.

Article content
DevOps Hands On Lab in Kolkata

6. The Canary in the Coal Mine: Blue/Green Deployment Gone Wrong

Scenario: You initiate a blue/green deployment using AWS CodeDeploy. The new instances (green environment) pass their health checks, and traffic is shifted. However, minutes later, you see a rise in 5xx errors and the new instances' CPU is high. How do you perform a rollback?

Answer

The beauty of a well-executed blue/green deployment is the ease of rollback. Since the original "blue" environment is still running, my immediate action is to initiate a rollback in CodeDeploy. This will re-route traffic back to the healthy blue instances, minimizing user impact within minutes. This is a "stop-the-bleeding" maneuver.

Once traffic is stabilized, I can investigate the root cause of the green environment's failure. I would analyze the logs from the failed green instances in CloudWatch Logs, looking for application errors, dependency failures (e.g., a new API call to an external service that is failing), or misconfiguration. I would also check if the health checks defined in the deployment group were too simplistic (e.g., only checking if the web server is up, not if the application is functional). A more robust health check, perhaps a custom one that hits a /health endpoint verifying all internal dependencies, could have caught this before the full traffic shift. The goal is to fix the application or configuration issue in the code and then re-try the deployment.


7. The Mystery of the Disappearing Logs

Scenario: Your application running on AWS Fargate is supposed to send logs to CloudWatch Logs. The logs were flowing correctly, but suddenly they have stopped appearing. The application itself seems healthy. How do you troubleshoot this?

Answer

Since the application is healthy, the issue is likely with the logging configuration or the Fargate task itself. My first step is to verify the most common point of failure: the IAM role. The ECS Task Execution Role must have the logs:CreateLogStream and logs:PutLogEvents permissions. I would check the IAM policy attached to this role to ensure it hasn't been accidentally modified.

If the permissions are correct, I would then check the CloudWatch Logs configuration in the ECS Task Definition. Is the correct AWS Logs Driver specified? Is the logGroup name correctly spelled? Next, I would check the quota limits for CloudWatch Logs. Have we hit the account-level quota for the number of log groups or the data ingestion rate? If the Fargate task is failing immediately, it might not even get to the logging stage. I would check the ECS service events in the console, which often provide clues like "ResourceInitializationError: unable to pull secrets" or "failed to configure logging," which would point to a different, but related, configuration issue in the task definition.


8. The Slow Database Query in RDS

Scenario: Your application, which uses an Amazon RDS (PostgreSQL) instance, has become progressively slower. Monitoring shows high CPU and read latency on the database. How do you identify and address the performance bottleneck?

Answer

High CPU and read latency typically point to inefficient queries or a lack of appropriate indexing. My first stop is Amazon RDS Performance Insights. This tool provides a visual dashboard of the database load, showing which SQL queries are consuming the most resources. I would look for queries with the highest "Average Active Sessions" or longest execution time.

Once I identify the problematic query, I would use the EXPLAIN ANALYZE command on it to see the query execution plan. This will often reveal a full table scan (Seq Scan) where an index should be used. The solution might be to create a new index on the columns involved in the WHERE or JOIN clauses. However, I must be cautious not to over-index, as indexes slow down write operations. Another possibility is that the instance is simply under-provisioned. If the workload has grown, it might be time to scale up the instance class or move to a read replica architecture to offload read traffic. Enabling the Slow Query Log in RDS can also help capture these queries automatically in the future.


9. The Security Group Labyrinth

Scenario: A new microservice (Service A) running in a private subnet cannot connect to another microservice (Service B) in a different private subnet. Both are ECS tasks. You've confirmed the tasks are running. How do you methodically diagnose the connectivity issue?

Answer

Network connectivity issues in AWS almost always boil down to four things: Security Groups, Network ACLs, Route Tables, or the target application itself. I would start with the simplest and most common culprit: Security Groups.

First, I would identify the Security Group (SG) attached to Service B (the target). This SG must have an inbound rule that allows traffic on the specific port from the SG of Service A (the source), or from the CIDR of its subnet. A best practice is to reference Security Group IDs, not IP ranges. Next, I would check the Security Group for Service A; it needs an outbound rule allowing traffic to Service B's port. If SGs are correct, I would check the Network ACLs for both subnets to ensure they allow the ephemeral port range (1024-65535) for return traffic. Finally, I would use VPC Reachability Analyzer to create a network path from the ENI of Service A to the ENI of Service B, which can programmatically pinpoint where the connection is being blocked.


10. Designing for Disaster: The Multi-Region Strategy

Scenario: The business has mandated a Recovery Time Objective (RTO) of 15 minutes and a Recovery Point Objective (RPO) of 5 minutes for your primary e-commerce application. How would you architect a multi-region disaster recovery solution on AWS?

Answer

Given the stringent RPO of 5 minutes (maximum data loss), an active/passive "pilot light" or "warm standby" model would be insufficient. I would recommend a Multi-Region Active/Active architecture using Amazon Route 53 with a failover routing policy. The application would be deployed in two regions, with read and write traffic going to the primary region under normal conditions.

To meet the RPO, data replication is key. For the RDS database, I would use a cross-region read replica in the secondary region. I would promote this to a standalone primary in a failover scenario. For static assets, Amazon S3 Cross-Region Replication would ensure data is copied. For dynamic session data, I would use Amazon DynamoDB Global Tables for synchronous replication. The CI/CD pipeline would be designed to deploy to both regions simultaneously. The failover process would be automated: CloudWatch alarms in the primary region would detect an outage and trigger a Lambda function to update the Route 53 routing policy to direct all traffic to the secondary region, achieving the sub-15-minute RTO.


Conclusion

Mastering AWS DevOps is more than memorizing service names; it's about developing a systematic approach to troubleshooting, a security-first mindset, and a deep understanding of how cloud-native services interact under real-world pressures. These scenarios are designed to probe that understanding. For interviewers, listen for the candidate's methodical process. For candidates, focus on demonstrating your logical reasoning, prioritization skills, and knowledge of AWS best practices. The best DevOps engineers are the ones who can not only build elegant systems but also expertly navigate them when things go wrong.


What other challenging AWS DevOps scenarios have you faced in interviews or on the job? Share them in the comments below!

To view or add a comment, sign in

More articles by Devraj Sarkar

Others also viewed

Explore content categories