2,073 Days of Unused Java App Uptime

2,073 days. That's the uptime on a t2.medium I found while auditing load balancers for a finops sweep. Five years, eight months, never rebooted. Running a Java app on JRE 1.8.0_25 — released October 2014, Obama's second term. I went looking because of an ALB with 150 requests in three months. Background noise. But I wanted to know what was making it. One of its target groups had exactly one registered instance. Unhealthy. Failing health checks. Nobody listening. CPU was at 27% average. Network pushing 500 MB a month in both directions. This wasn't idle infrastructure — something was burning cycles. I SSHed in and ran ss. Hundreds of concurrent connections on port 9000. Scrapers, bots, scanners hammering the TensorFlow inference endpoint directly via the public IP, bypassing the load balancer entirely. The ALB's own health checks? Failing. The thing the load balancer was trying to help was getting served by strangers instead. A Ruby 2.0 CodeDeploy agent was running alongside the Java app, polling every few seconds for a deployment that would never come. It had been waiting since 2021. $142 a month. $1,700 a year. Serving inference results to internet scrapers on a Java runtime from the Obama administration. The tools optimize what runs. They don't tell you what doesn't. A load balancer with zero requests is free signal — follow it to the instance, follow the instance to the process, follow the process to the connections. The archaeology isn't in the dashboard. It's in ss and ps aux and the 2,073-day uptime counter that nobody ever looked at. Link in first comment. #platformengineering #finops

1 Comment

Fizz Orange 3w

https://ferkakta.dev/zombie-java-app-serving-scrapers/

To view or add a comment, sign in

More Relevant Posts

Aman Sahni
3w
Report this post
Not every backend problem should be solved at the framework level. Senior engineers know when to stop hacking the framework and move the problem to where it belongs. 5 common mistakes: 1. Scheduled jobs on every pod → @Scheduled + 5 replicas = job runs 5 times. Duplicate emails. Duplicate charges. → Fix: ShedLock for coordination, OR move scheduling to Kubernetes CronJob. 2. Large file uploads in Spring Boot → @RequestParam MultipartFile for 10GB files. Pods OOMKilled. → Fix: Pre-signed URLs. Client uploads directly to S3/GCS/MinIO. Your server only authorizes. 3. Search using SQL LIKE → WHERE name LIKE '%phone%'. Works in dev. 10M rows in prod = 5-second queries. DB falls over. → Fix: Elasticsearch, OpenSearch, or Postgres full-text search. Transactional DBs weren't built for search. 4. Background jobs on REST APIs → Controller processes a 30-second report synchronously. Timeouts. Duplicate retries. Thread pool exhausted. → Fix: Publish to Kafka/RabbitMQ/SQS. Return 202 Accepted. Worker handles it async. 5. Rate limiting in controllers → HashMap counter on 5 pods = user gets 5x the rate limit. Hacker wins. → Fix: Move to API Gateway. Kong, Nginx, AWS API Gateway. Enforced before the request reaches your app. The lesson: When you're hacking the framework to solve a problem — pause. Ask: is this the right LEVEL to solve this? Your framework is for business logic. Not for solving every infrastructure problem inside your application code. What's one career lesson you've learned the hard way? Drop it in the comments 👇 #backend

1 Comment
Like Comment
To view or add a comment, sign in
Saikumar Reddy
3w
Report this post
One thing I’ve learned the hard way: “If an API works fast locally, it means nothing.” I worked on an API that looked perfect in testing: • <100ms response time • Clean implementation • No visible issues But under real traffic, latency started spiking: • 100ms → 800ms → 2s+ • Occasional timeouts • Downstream impact No errors. No crashes. Just slow degradation. That’s where most people get stuck. Breaking it down: Logs looked clean JVM and CPU were stable DB started showing increased load Digging deeper: • Found repeated DB calls for the same data (N+1 pattern) • No effective caching for high-frequency requests Fix wasn’t scaling infra. It was fixing the design: • Eliminated redundant DB calls • Added indexing on frequently queried columns • Introduced Redis caching with controlled TTL • Avoided caching user-specific data to prevent stale responses Result: Latency dropped from ~2s to <200ms under load DB load reduced significantly System handled higher traffic without scaling aggressively Reality: Performance problems don’t show up in code reviews. They show up when your system is under pressure. If you’re not testing for that, you’re not building production-ready systems. #Java #SpringBoot #Performance #Microservices #BackendEngineering #SystemDesign
Like Comment
To view or add a comment, sign in
Vikas Nigam
2w
Report this post
𝗪𝗲 𝗱𝗼𝘂𝗯𝗹𝗲𝗱 𝘁𝗵𝗲 𝗽𝗼𝗱 𝗺𝗲𝗺𝗼𝗿𝘆 𝗹𝗶𝗺𝗶𝘁 “𝗷𝘂𝘀𝘁 𝘁𝗼 𝗯𝗲 𝘀𝗮𝗳𝗲.” 𝗥𝗦𝗦 𝘄𝗲𝗻𝘁 𝘂𝗽 𝗮𝗻𝘆𝘄𝗮𝘆—𝗮𝗻𝗱 𝘁𝗵𝗲 𝘀𝗲𝗿𝘃𝗶𝗰𝗲 𝗹𝗼𝗼𝗸𝗲𝗱 “𝗵𝗲𝗮𝗹𝘁𝗵𝗶𝗲𝗿.” 𝗪𝗮𝘀 𝘁𝗵𝗮𝘁 𝗮 𝘄𝗶𝗻, 𝗼𝗿 𝗮 𝗺𝗶𝗿𝗮𝗴𝗲? There’s a quiet myth in performance engineering: 👉 “Raise Kubernetes memory limits, apps will only use what they need.” Reality? Not always. 😊 What actually happens: • Heap scales with limit Without pinned -Xmx / MaxRAMPercentage, higher limits → larger heap → more retention, less GC → higher steady-state memory. • Wrong success metric Low RSS isn’t “efficient.” It can mean GC churn. Higher RSS may improve p99, not necessarily cost. • OOM doesn’t vanish Bigger limits delay failure and shift it elsewhere (GC pressure, native memory, fragmentation). • Limits ≠ JVM truth Kubernetes limits can become implicit JVM tuning via cgroup memory awareness. Takeaway 😉 : Treat memory like capacity planning, not vibes. If you change limits, re-baseline heap vs native, GC time/pauses, allocation rate, and p99 latency—not just “GB used.” #Kubernetes #CloudNative #DevOps #SRE #performance #JVM #Java #GarbageCollection #PerformanceEngineering #Observability #Performance #Scalability #Latency #Throughput #Memory #Heap
Like Comment
To view or add a comment, sign in
Mehul Tyagi
3w
Report this post
Lately, I’ve been revisiting some HOT backend/design topics The focus has been on real-world, practical areas like: • Implementing LRU Cache • Designing resilient microservices (circuit breakers, fallbacks) • Understanding JWT architecture and logout in stateless systems • Implementing idempotency to prevent duplicate transactions • Core Spring Boot concepts like @Cacheable, @Async, and exception handling • Practical scenarios like service-to-service communication, caching pitfalls, and Spring’s proxy-based behavior. • Optimization (indexing, query tuning, N+1 problem) • Concurrency & multithreading

6 Comments
Like Comment
To view or add a comment, sign in
Tej S.
1w Edited
Report this post
Why Your JVM Heap Still Crashes Under Load? 📉 Symptoms ⚠️ * Latency jumps from ms to seconds * Frequent Full GC pauses * System becomes unstable during peak How to resolve 🛠️ * Check GC logs * Check allocation rate - In Spring Boot app, massive short lived object creation can cause high allocation rate than the garbage collection rate. Due to that G1 struggles, old gen fills up quickly. This eventaully leads to frequent Full GC (stop-the-world). Optimize JSON parsing (stream instead of full load) 📄. * Avoid undersized Old Gen(-Xms & -Xmx). Fixed heap avoids resizing overhead. * Tune different JVM parameters like MaxGCPauseMillis, InitiatingHeapOccupancyPercent, G1HeapRegionSize, UseStringDeduplication, ParallelGCThreads, ConcGCThreads as per your application non functional requirements. Sometimes JVM tuning is not enough. In those cases, you can work on vertical/horizontal scaling and system design level changes like add Queues (or AWS SQS) for buffering and caching with Memcache (or AWS ElastiCache). Found this helpful? Follow me for more insights on Software Design, building scalable E-commerce applications, and mastering AWS. Let’s build better systems together! 🚀 #Java #SystemDesign #JVM #SpringBoot #Microservices #Performance
1 Comment
Like Comment
To view or add a comment, sign in
Satyam Parmar
2w
Report this post
⚠️ Your system is “highly available”… until one tiny dependency isn’t. And suddenly — everything is down. --- 🔍 The high availability illusion Teams design for: ✔️ Multi-zone deployment ✔️ Load balancing ✔️ Auto-scaling ✔️ Redundant services And proudly say: > “We are highly available.” But they forget: ❌ Single database cluster ❌ One cache layer ❌ One message broker ❌ One third-party API ❌ One DNS dependency Your system is only as available as its weakest dependency. --- 💥 Real production scenario Core service deployed across regions. Looked resilient. But depended on: Single Redis cluster One payment API Redis slowed down. Result: Cache misses increased DB load spiked Latency exploded Requests failed Multi-region system. Single point of failure. --- 🧠 How senior engineers design availability They map dependencies explicitly. ✔️ Identify all critical components ✔️ Remove single points of failure ✔️ Add fallback strategies ✔️ Use graceful degradation ✔️ Design for partial availability They don’t ask: > “Is my service highly available?” They ask: > “What can take my system down?” --- 🔑 Core lesson High availability is not a feature. It’s an end-to-end property. If one dependency fails and your system collapses — you were never highly available. --- Subscribe to Satyverse for practical backend engineering 🚀 👉 https://lnkd.in/dizF7mmh If you want to learn backend development through real-world project implementations, follow me or DM me — I’ll personally guide you. 🚀 📘 https://satyamparmar.blog 🎯 https://lnkd.in/dgza_NMQ --- #BackendEngineering #HighAvailability #SystemDesign #DistributedSystems #Microservices #Java #Scalability #ReliabilityEngineering #Satyverse
Like Comment
To view or add a comment, sign in
Mayank Verma
2w
Report this post
Had one of those “everything looks fine… but it’s not” production moments recently. An API that usually responds in ~120ms suddenly started taking 2–3 seconds. No errors. No crashes. Just… slow. At first glance, nothing obvious: CPU was okay, memory wasn’t maxed out, service was up. But digging deeper turned into a good reminder of how real-world slowness actually happens 👇 --- Started with threads. Tomcat thread pool was almost full. Not completely exhausted, but close enough that new requests were waiting. So the service wasn’t doing more work — it was just taking longer to start doing the work. --- Then the DB. One query that used to take ~20ms was now taking ~150ms. Why? Data had grown. Index wasn’t helping anymore the way we expected. And of course… there was a hidden N+1 query in one flow. Didn’t matter in testing. Hurt in production. --- Then downstream calls. This API was calling 2 other services. Individually fast (~50–80ms), but together they added up. And when one of them slowed slightly, everything stacked. No timeout issues. Just latency compounding quietly. --- The interesting part? None of these were “major bugs”. It was: – slightly slower DB – slightly busy threads – slightly delayed downstream service All happening together. --- And that’s when it hits you: We don’t usually design systems to fail — we design them assuming things will stay fast. But in reality, systems degrade, not break. --- What helped: Stopped guessing. Looked at: – thread metrics – DB query timings – per-service latency Fixed the biggest contributor first (DB query + fetch strategy), and suddenly everything else started looking normal again. --- Big takeaway for me: Performance issues in microservices are rarely dramatic. They’re gradual, layered, and easy to miss until users feel them. And debugging them is less about “what’s broken?” and more about “where is time actually going?” #Java #SpringBoot #Microservices #ProductionIssues #BackendEngineering #SystemDesign
Like Comment
To view or add a comment, sign in
Alex Chevrier
2w Edited
Report this post
I cut read latency by 38% by fixing a concurrency bug I introduced myself. Last week I pre-allocated a ByteBuffer to avoid creating one per read. Clean, right? Wrong. ReentrantReadWriteLock allows concurrent readers. Two virtual threads entering read() simultaneously would race on the same buffer — one thread's clear() stomping on another's flip(). BufferUnderflowException waiting to happen. The real fix turned out to be better than the original plan. Instead of reading the 4-byte length header separately, I read the entire log record — header + payload — in one FileChannel.read() call into the caller's buffer. Then extract the length with getInt(), skip the 8-byte offset field, and the remaining bytes are exactly the payload. 2 syscalls → 1. Zero shared state. Thread-safe by design. Results (JMH SampleTime, 100k messages, page cache warm): p50: 751ns → 465ns (−38%) p99: 1,114ns → 734ns (−34%) p100: 1.2ms → 718µs (−42%) Max GC pause: 2ms flat Also eliminated a String allocation on every read by pre-computing topic-partition routing keys at topic creation time. Pay the cost once at startup, zero on the hot path. The storage engine now allocates nothing per read. The remaining GC is at the application boundary — unavoidable until the protocol speaks ByteBuffer end to end. link: https://lnkd.in/gfANhFti Full methodology, JFR evidence, and step-by-step analysis in the blog post: https://lnkd.in/g-nBQBHd #java #performance #distributedsystems #jvm #lowlatency

GitHub - alchevrier/distributed-messaging: From-scratch distributed messaging system — append-only log, binary TCP over NIO, MurmurHash3 partitioning, and hand-rolled Raft consensus with leader election, log replication, and integration tests. Phase 5: off-heap storage + JMH benchmarks on bare-metal Linux. github.com
Like Comment
To view or add a comment, sign in
Sumeet Shukla
1w
Report this post
You hit “Enter” on a URL… and within milliseconds, you get a response. But here’s the truth most engineers miss 👇 👉 Your API doesn’t start in your controller… 👉 It starts in the OS kernel Before your Spring Boot app even sees the request: • DNS resolves the domain • OS creates a socket (file descriptor) • TCP handshake establishes a connection • TLS secures the channel • Data is split into TCP packets • Kernel buffers and reassembles everything And only then… your application gets a chance to run. --- 💡 The uncomfortable reality: Most developers spend 90% of their time optimizing: ✔ Controllers ✔ Queries ✔ Business logic But ignore the layers that actually control: ❌ Latency ❌ Throughput ❌ Scalability --- ⚙️ Real performance lives in: • Kernel queues (SYN queue, accept queue) • Socket buffers • Syscalls (accept, read, write) • Threading vs event-loop models • TCP/IP behavior --- 🚨 That’s why in production you see: • High latency with “fast” code • Thread exhaustion under load • Random connection drops • Systems that don’t scale --- 🧠 The shift that changed how I design systems: I stopped thinking in terms of “APIs” and started thinking in terms of: 👉 Data moving through layers Browser → OS → Kernel → Network → Server → App → Back --- If you understand this flow, you don’t just write code… 👉 You build systems that scale. --- 👇 I’ve broken this entire flow down (end-to-end) in the carousel Comment “DEEP DIVE” if you want the next post on: ⚡ epoll vs thread-per-request (what actually scales to millions of requests) #SystemDesign #BackendEngineering #DistributedSystems #Java #SpringBoot #Networking #Scalability #SoftwareEngineering #TechDeepDive
1 Comment
Like Comment
To view or add a comment, sign in
Rahul Ladumor
3w
Report this post
Most Lambda cold start benchmarks are measured in a vacuum. That's the problem. ⚡ They're incomplete. I've seen posts showing cold starts of 200-400ms for Node. js runtimes and calling it "solved. " But that's not what real production looks like - not even close. Here's what the benchmarks miss: Single-function invocations. Not concurrent bursts that actually matter. They ignore VPC attachment latency, which adds 300-900ms in real setups. Dependency initialization inside the handler? Skipped. They use minimal memory configs when most production workloads actually run on 512MB-1024MB. And here's the kicker - they completely ignore the ENI provisioning penalty that hammers the FIRST call after a quiet period. Real numbers are different. In my experience, actual cold starts in VPC-attached functions with a 512MB config and a mid-size dependency tree run 900ms-1.4s. Not 300ms. What actually helps in production: Provisioned Concurrency. Use it on predictable traffic patterns based on your own requirements - not everywhere. Keep your handler lean, and move heavy init OUTSIDE the function handler where it belongs. Lambda SnapStart works for Java if you're stuck with JVM workloads. Right-size your memory too - more RAM equals faster CPU allocation, and cold starts drop measurably. And don't overlook your database layer - a DynamoDB incident I dealt with at 3:47AM had functions waiting 847ms on queries alone, which dropped to 23ms after restructuring with a sparse GSI. Split fat functions. One 80MB deployment package? That's the real enemy. The 2025 benchmarks are cleaner than before. But they're still not production-first. Your actual numbers depend on VPC config, runtime, and dependency weight - factors the clean benchmarks conveniently abstract away. So here's my question: What's the worst cold start you've hit in a live environment - and what actually fixed it? 💡
Like Comment
To view or add a comment, sign in

348 followers

114 Posts

View Profile Connect

2,073 Days of Unused Java App Uptime

More Relevant Posts

Explore content categories