In one of my recent projects, there was an app which was failing in production randomly. But it was dying consistently. Users were not able to launch their clients - "No active Server found". Existing clients were hung. The silver bullet ops used - server restart. Everything smooth. Then after a few weeks, same issue. The story repeated. Logs showed no errors. Heap dumps, thread dumps, memory, CPU - everything was looked at. No exceptions, no spikes, no BLOCKED threads. One expert suggested moving from 8GB to 20GB heap. But there was no OutOfMemoryException either. Nothing was moving. Until I kept looking for signals. The thread dumps finally showed something. All worker threads were in a RUNNING state - which was an illusion. They were actually stuck executing an external Perl script. The script was hitting database errors. But the real problem was the Java code handling it. When Java calls an external script, it communicates through two streams - STDOUT and STDERR. The code was reading them one at a time. STDOUT first, then STDERR. Meanwhile the script was writing errors to STDERR. Nobody was reading it. The pipe filled up. The script got stuck. Because it got stuck, STDOUT never closed. Because STDOUT never closed, the Java thread waited forever. One worker swallowed. Then another. Then another. Until there were no workers left. "No active server found." --- The code did have a timeout mechanism - a timer thread that would interrupt the stuck worker after 5 minutes. It looked like a safety net. It was an illusion. Thread interrupts only work on BLOCKED or WAITING threads. A thread stuck in native I/O appears RUNNING - interrupts don't reach it. The timer fired. The worker ignored it. The safety net had a hole nobody could see. --- The fix - drain STDOUT and STDERR concurrently using dedicated "gobbler" threads so neither pipe fills up. Add a process-level timeout that terminates the script itself. --- This had been repeating for months. A safety mechanism that gave everyone confidence - but didn't actually work. Small assumptions. Big consequences. Sometimes the most important thing you can do is refuse to accept "restart and move on" as an answer. --- Have you ever come across a problem that was hiding in plain sight? #SoftwareEngineering #Java #ProductionEngineering #LessonsLearned #TechLeadership #ProblemSolving
Excellent work Kunal K Motiani and Im happy I was also part of the brainstorming discussions that we had and eventually we hit the goal 👏
This is a fantastic share, Kunal! 🙌 Really appreciate the depth of your observations around real-world engineering and production scenarios. Posts like these help the community learn beyond textbooks. Keep sharing such valuable perspectives!
This is a great share Kunal 👏 . Really appreciate the informative insight on OS level stdout and stderr streams.