Have you ever spent hours debugging something that looks fine on the surface but refuses to work no matter what you try? That was me with a Nexus Repository setup on EC2 — everything was “successful”… until it wasn’t. I ran into a situation where Nexus would start and immediately shut down, Java errors kept pointing in different directions, and logs were either missing or misleading. For a while, it looked like: a Java issue a memory issue even a permissions problem But none of that was the real problem. What was actually going on? The Nexus tarball was being extracted into /tmp, which in this environment is a RAM-backed filesystem with limited space. The archive was ~400MB, and during extraction it silently got cut off halfway. So what I ended up with was: a partially installed, corrupted Nexus setup That’s why: Java failed instantly services kept stopping logs never properly generated everything felt “randomly broken” The fix Once I spotted the real issue, the solution was simple: Moved extraction from /tmp to /opt (actual disk storage) Cleaned up corrupted files and stale PID data Reinstalled Nexus properly Boom 💥 — everything came up clean on first run. Lesson learned In DevOps, not every complex-looking failure is actually complex. Sometimes the real issue is just: “Wrong place. Wrong assumption. Wrong storage.” Always validate: where your files are extracted available disk space (df -h) integrity of downloaded archives Small oversights can look like big system failures. If you’ve ever chased a “ghost bug” that turned out to be something simple in disguise, you know the feeling. #DevOps #Linux #AWS #Jenkins #Nexus #CloudComputing #SystemAdministration #CI_CD #Troubleshooting #DevOpsEngineer #LearningInPublic
Nexus Repository Setup Troubleshooting on EC2
More Relevant Posts
-
Real Microservices Lesson: When One Service Goes Down, Everything Can Crash While working with microservices, I encountered an issue that many systems face but often only realize after it breaks things. Let’s say there are two microservices: MS1 and MS2. MS1 depends on MS2 for fetching some data. Everything works fine… until MS2 goes down. Now here’s where things get interesting 👇 Even after MS2 was stopped, MS1 kept sending requests to it. Those requests kept waiting for a response that would never come. 💥 Problem: The application’s thread pool started getting exhausted because threads were stuck waiting on a non-responsive service. 📉 Impact: Eventually, the entire application crashed with multiple 500 Internal Server Errors. 🛠️ Solution: Circuit Breaker To fix this, I implemented a Circuit Breaker pattern. Think of it as a safety switch for microservices: -> When a dependent service fails repeatedly, the circuit breaker trips (opens) -> It stops further calls to that failing service. ->Instead of waiting, the system returns a fallback response. ->This gives the downstream service time to recover and avoiding system to crash. ⚡ Why it matters: Prevents cascading failures Avoids thread exhaustion Enables graceful degradation Improves overall system resilience 💡 Key takeaway: In distributed systems, failure is inevitable. What matters is how gracefully your system handles it. 👉 In the next post, I’ll explain the states of a circuit breaker and share a code example where I’ve applied it. #Microservices #Java #SpringBoot #SystemDesign #BackendDevelopment #Resilience #CircuitBreaker
To view or add a comment, sign in
-
🐧𝐏𝐫𝐨𝐝𝐮𝐜𝐭𝐢𝐨𝐧 𝐈𝐬𝐬𝐮𝐞 𝐚𝐭 𝟐:𝟏𝟑 𝐀𝐌… 𝐅𝐢𝐱𝐞𝐝 𝐰𝐢𝐭𝐡 𝟏 𝐁𝐚𝐬𝐡 𝐒𝐜𝐫𝐢𝐩𝐭 2:13 AM - PagerDuty alert. CPU spiking. Disk almost full. APIs timing out. Not the kind of notification you want in the middle of the night. 🔹I logged into the Server and saw this 👇 - Disk usage: 95% - /var/log was flooded with logs - One service was generating GBs of logs every hour Classic issue. But production was already impacted. 🔹Quick options: - Restart service? (temporary fix ❌) - Scale infra? (slow + costly ❌) - Fix root cause? (takes time ❌) 👉 Needed an immediate workaround 🔹I wrote a quick Bash script: #!/𝑏𝑖𝑛/𝑏𝑎𝑠ℎ 𝐿𝑂𝐺_𝐷𝐼𝑅="/𝑣𝑎𝑟/𝑙𝑜𝑔/𝑚𝑦𝑎𝑝𝑝" 𝑀𝐴𝑋_𝑆𝐼𝑍𝐸=100𝑀 𝑓𝑖𝑛𝑑 $𝐿𝑂𝐺_𝐷𝐼𝑅 -𝑡𝑦𝑝𝑒 𝑓 -𝑛𝑎𝑚𝑒 "*.𝑙𝑜𝑔" -𝑠𝑖𝑧𝑒 +$𝑀𝐴𝑋_𝑆𝐼𝑍𝐸 | 𝑤ℎ𝑖𝑙𝑒 𝑟𝑒𝑎𝑑 𝑓𝑖𝑙𝑒; 𝑑𝑜 𝑒𝑐ℎ𝑜 "𝑇𝑟𝑢𝑛𝑐𝑎𝑡𝑖𝑛𝑔 $𝑓𝑖𝑙𝑒" 𝑡𝑟𝑢𝑛𝑐𝑎𝑡𝑒 -𝑠 0 $𝑓𝑖𝑙𝑒 𝑑𝑜𝑛𝑒 🔹Ran it.. and within seconds: - Disk usage dropped from 95% → 40% ✔ - API latency normalized ✔ - Alerts stopped ✔ Production stabilized. 🔹Next steps (after fire-fighting): - Fixed log rotation properly - Added monitoring alerts 🔹Lesson learned: In DevOps, tools matter… But problem-solving under pressure matters more 🔹Sometimes: 👉 A simple Bash script > complex architecture discussion Set up limits to prevent recurrence #DevOps #ProductionIssue #Bash #SRE #OnCall #Cloud #IncidentManagement #Linux #AWS
To view or add a comment, sign in
-
🚀 New blog post: Writing a Kubernetes CSI Driver from Scratch 👉 https://lnkd.in/d7xGEK5h This is the first in a series on the essential Kubernetes integrations that make a cluster truly cloud-native. Today: storage. Writing a CSI driver is one of those things that looks intimidating from the outside but turns out to be genuinely fun once you understand the moving parts. After building one end-to-end in production, I wrote down everything I wished I'd had in one place. The article covers: How CSI actually works — gRPC services, sidecar containers, Unix sockets The full volume lifecycle: provision → attach → stage → mount (and the reverse). Implementing IdentityService, ControllerService, and NodeService in Go Snapshots, volume expansion, node registration The pitfalls: idempotency, async polling, xfs UUID collisions, mount propagation. P.S Let me know if site and diagrams are not pleasant to look at. I'm still experimenting with getting mermaid diagrams rendered right via Hugo. #Kubernetes #Go #CloudNative #Storage #CSI #Platform #DevOps
To view or add a comment, sign in
-
I recently got to know about a useful method to access a node’s host filesystem in Kubernetes — kubectl debug. I am currently troubleshooting a Redpanda deployment where one of the pods is stuck in the Pending state due to a storage provisioning issue. In our setup, Persistent Volumes (PVs) are dynamically provisioned using an LVM-based local storage provisioner. Since the pod is waiting for its Persistent Volume Claim (PVC) to be fulfilled, part of the investigation involves verifying disk usage and checking whether the expected directories and storage resources are present on the node. The challenge — SSH access was not possible. No SSH keys were created during node provisioning, inbound SSH is not allowed in the Security Group / firewall, and the nodes run in private subnets without public access. Other node access methods exist, but they involve multiple manual steps and approvals, and some are restricted at the organization level. This is where kubectl debug node helped. kubectl debug node/<node-name> -it --image=ubuntu -- bash This command deploys a temporary debug pod directly onto the node. The container runs in the host namespaces, and the node’s filesystem is mounted inside the container at /host, allowing inspection of disk usage and directories without requiring SSH access. Some of the checks performed during troubleshooting: df -h /host du -h /host lsblk ls /host/<expected-directory> One practical detail I noticed: Not all system tools are available by default inside the debug container. For example, LVM utilities such as pvs, vgs, and lvs may not be present unless the required packages are installed. So the available commands depend on the image used for debugging. The investigation is still ongoing, but this method has already been very useful for validating node-level storage configuration in a restricted environment. Still learning, still debugging. #Kubernetes #DevOps #SRE #Storage #PlatformEngineering #K8s #AWS #EKS
To view or add a comment, sign in
-
Don't use a single .env file for all services in production. Using one environment file for both stateless applications and stateful databases in docker-compose creates unnecessary risks and configuration drift. Unintended Restarts. Docker Compose tracks the state of the env_file. If you modify a variable intended only for the backend, Compose detects a configuration change for every service referencing that file. This triggers a recreation of your database containers even when no database changes were made. The Risk: Authentication Mismatch For services like PostgreSQL, environment variables like POSTGRES_PASSWORD are typically used only during the initial volume initialization. If a shared .env is updated with a new password, the container restarts with the new variable. The actual database (persisted in the volume) continues to use the old password. This results in an immediate authentication failure between the application and the database. The Solution: Configuration Isolation Each service should only have access to the variables it strictly requires. .env.backend .env.db #backend #devops #docker #infrastructure
To view or add a comment, sign in
-
🔮Production Deployment #springboot microservices app on a k8s #HA CLUSTER The guide now includes: Hybrid PostgreSQL HA: Step-by-step instructions for deploying the PostgreSQL Primary on a dedicated standalone VM for maximum isolation and performance, while managing replicas within the Kubernetes cluster using the CloudNativePG operator. Secure Connectivity: Configuration for replication between the VM and the cluster, including `pg_hba.conf` settings and Kubernetes `ExternalName` services for seamless application access. Harbor Registry: Standalone installation on a dedicated VM for private image management va. HashiCorp Vault: Centralized secret management for banking-grade security. Loki/Prometheus/Grafana: Full observability stack for metrics and logs. GitOps CI/CD with GitHub Actions and ArgoCD. This comprehensive PDF now provides a complete, production-ready blueprint using an example of banking microservices, balancing the flexibility of Kubernetes with the stability of standalone database hosting. I’ve also re-attached the architectural diagrams for reference. #java #springboot #k8s #docker #monitoring
To view or add a comment, sign in
-
Bus factor belongs in DevSecOps because software supply chain risk starts before deployment. This matters because CI/CD pipelines can move fragile dependencies into production faster than teams can assess them. The article closes on a practical theme: dependency risk is not only about licenses or vulnerability counts, but also the sustainability of the people behind the code. It points to auditing signals such as organizational diversity, labor investment, and documentation quality. Those are useful controls for operators because they help distinguish a widely used package from a resiliently maintained one. For Linux and platform teams, this affects build pipelines, base images, internal package mirrors, and release approvals. If a critical upstream dependency is maintained by too few people, weakly documented, or overly concentrated in one employer, the downstream effect can be slower fixes, weaker review, and harder incident response when something changes unexpectedly. That is especially relevant for packages embedded in containerized workloads and automated deployment paths. Many CI pipelines verify whether dependencies are current, but not whether the projects behind them are operationally durable. For Linux administrators and infrastructure teams, this has practical implications. In practical terms, it is a good time to review: - whether dependency gates include criticality and maintainer-health signals - SBOM coverage for build-time and runtime packages - provenance and signing checks for promoted artifacts - exception handling for unmaintained or thinly maintained dependencies - image rebuild frequency for inherited package updates - escalation paths when a core open-source component shows governance or maintainer stress Article: https://lnkd.in/g5i2HX9u #DevSecOps #LinuxSecurity #OpenSourceSecurity #SupplyChainSecurity
To view or add a comment, sign in
-
2,073 days. That's the uptime on a t2.medium I found while auditing load balancers for a finops sweep. Five years, eight months, never rebooted. Running a Java app on JRE 1.8.0_25 — released October 2014, Obama's second term. I went looking because of an ALB with 150 requests in three months. Background noise. But I wanted to know what was making it. One of its target groups had exactly one registered instance. Unhealthy. Failing health checks. Nobody listening. CPU was at 27% average. Network pushing 500 MB a month in both directions. This wasn't idle infrastructure — something was burning cycles. I SSHed in and ran ss. Hundreds of concurrent connections on port 9000. Scrapers, bots, scanners hammering the TensorFlow inference endpoint directly via the public IP, bypassing the load balancer entirely. The ALB's own health checks? Failing. The thing the load balancer was trying to help was getting served by strangers instead. A Ruby 2.0 CodeDeploy agent was running alongside the Java app, polling every few seconds for a deployment that would never come. It had been waiting since 2021. $142 a month. $1,700 a year. Serving inference results to internet scrapers on a Java runtime from the Obama administration. The tools optimize what runs. They don't tell you what doesn't. A load balancer with zero requests is free signal — follow it to the instance, follow the instance to the process, follow the process to the connections. The archaeology isn't in the dashboard. It's in ss and ps aux and the 2,073-day uptime counter that nobody ever looked at. Link in first comment. #platformengineering #finops
To view or add a comment, sign in
-
🚀 Backend Learning | Load Balancing for Scalable Systems While working on backend systems, I recently explored how traffic is distributed across multiple servers using load balancing. 🔹 The Problem: • Single server getting overloaded under high traffic • Increased latency and system downtime • Need for high availability and scalability 🔹 What I Learned: • Load Balancer distributes incoming requests across multiple servers • Improves performance and ensures system reliability 🔹 Common Strategies: • Round Robin: Requests distributed sequentially • Least Connections: Sends traffic to server with fewer active connections 🔹 Key Insights: • Round Robin works well for equal capacity servers • Least Connections is better for uneven loads • Helps achieve high availability and fault tolerance 🔹 Outcome: • Better traffic distribution • Reduced server overload • Improved system scalability Scalable systems are not built on a single server — they are built on smart traffic distribution. 🚀 #Java #SpringBoot #SystemDesign #BackendDevelopment #LoadBalancing #Microservices #LearningInPublic
To view or add a comment, sign in
-
-
This week, I completely refactored the infrastructure architecture of my home lab and completed a massive DevSecOps migration. I recently transitioned my full-stack environment (Nebula Forge) off a heavy, monolithic Ubuntu VM—which had been natively hosting monitoring tools like Grafana and Prometheus alongside my applications—and re-engineered the entire pipeline to run on a highly optimized Proxmox LXC container acting as a centralized Docker Host. Moving from traditional package installations to an isolated, containerized microservice architecture brought several massive advantages to the environment: 📉 𝐑𝐞𝐬𝐨𝐮𝐫𝐜𝐞 𝐎𝐩𝐭𝐢𝐦𝐢𝐳𝐚𝐭𝐢𝐨𝐧: Swapping a thick Ubuntu VM for a minimalistic Debian LXC eliminated the resource contention between the hypervisor and the VM. The compute and memory footprint has been drastically reduced, freeing up valuable hardware resources for future scaling. 🔒 𝐙𝐞𝐫𝐨-𝐓𝐫𝐮𝐬𝐭 𝐒𝐞𝐜𝐮𝐫𝐢𝐭𝐲 & 𝐍𝐞𝐭𝐰𝐨𝐫𝐤 𝐒𝐞𝐠𝐦𝐞𝐧𝐭𝐚𝐭𝐢𝐨𝐧: By utilizing Docker networks and redirecting Cloudflare Zero Trust Tunnels, I completely bypassed traditional pfSense NAT port forwarding. The internal applications are deeply segmented, and the public perimeter is locked down. 🧩 𝐂𝐞𝐧𝐭𝐫𝐚𝐥𝐢𝐳𝐞𝐝 𝐎𝐫𝐜𝐡𝐞𝐬𝐭𝐫𝐚𝐭𝐢𝐨𝐧: Managing a multi-database environment (MySQL and MongoDB), a Spring Boot backend, a Go API Gateway, and high-availability frontends is now centralized through Portainer, providing distinct container isolation without the overhead. 💾 𝐒𝐭𝐫𝐞𝐚𝐦𝐥𝐢𝐧𝐞𝐝 𝐃𝐢𝐬𝐚𝐬𝐭𝐞𝐫 𝐑𝐞𝐜𝐨𝐯𝐞𝐫𝐲: The old VM setup was a massive data hog. Containerizing the apps and mapping persistent volumes allows for highly efficient snapshotting and makes adhering to strict 3-2-1 backup procedures significantly easier and faster. During the migration, I also successfully untangled hardcoded port conflicts, implemented a "cold standby" high-availability frontend, and navigated live database credential rotations via CLI to bring the Spring Boot environment fully online with zero data loss. There is nothing quite like the satisfaction of watching a complex transaction flow securely from the public internet, through a Cloudflare tunnel, into a containerized Java backend, and commit perfectly across both relational and NoSQL databases. On to the next challenge! #DevSecOps #DevOps #PlatformEngineering #CloudSecurity #SRE #SiteReliabilityEngineering #Proxmox #Docker #Cloudflare #InfrastructureAsCode #CyberSecurity #CI_CD #TechHomeLab
To view or add a comment, sign in
-
Explore content categories
- Career
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Hospitality & Tourism
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development
This one really taught me to slow down and question the basics first before diving into “complex” assumptions. That /tmp detail cost me way more time than I’d like to admit, but it’s definitely sticking with me now.