Fixing Python Service Timeout with curl_cffi

We had a Python service where every HTTP request took exactly 15 seconds. Never 12, never 18. When every operation lands on the same number, you're not measuring the operation, you're measuring a timeout. py-spy dump on a worker showed the main Python thread idle, the executor thread idle, and a hidden tokio-rt-worker thread parked on a futex. That third thread was the clue. The HTTP client (primp, Rust + tokio under the hood) holds a tokio runtime that does not survive fork(). Anything using a prefork model, including Celery, gunicorn --preload, and multiprocessing with the fork start method, produces children with a broken runtime that hangs on every request until an outer timeout fires. The fix was a one-line swap to curl_cffi (libcurl-based, fork-safe); end-to-end latency went from 15 s to 0.5 s with zero behavior change. The takeaway I keep relearning: mixing native runtimes with fork() is a footgun, and a too-clean number in your metrics is a clue, not a feature. #Python #Debugging #SoftwareEngineering

  • No alternative text description for this image

To view or add a comment, sign in

Explore content categories