Python GIL vs No-GIL Benchmarks with FastAPI

🚀 Python GIL vs No-GIL — Real FastAPI Benchmarks (Python 3.13) Free-threaded Python is no longer just an experiment — it’s starting to show real impact. I came across a benchmark comparing Python 3.12 (with GIL) vs Python 3.13t (No-GIL) using FastAPI, and the results are pretty interesting 👇 💡 Key Takeaways: 🔹 Massive CPU Boost (~8x) CPU-bound endpoints jumped from ~4 RPS to ~32 RPS — with ZERO code changes. This is what true parallelism across cores looks like. 🔹 Threading inside requests ≠ better performance Even without GIL, spawning threads inside a single request didn’t help. Why? Under load, request-level parallelism already saturates the CPU. Extra threads just add overhead. 🔹 I/O performance unchanged No surprise here — GIL was never the bottleneck for I/O-bound workloads. Async + I/O still behaves the same. 📊 What this means in practice: ✅ Use No-GIL Python when: - You have CPU-heavy APIs (ML inference, image processing, data pipelines) - High concurrency + CPU contention exists - You previously relied on multiprocessing to bypass GIL ❌ Don’t expect gains if: - Your app is mostly I/O (DB calls, HTTP requests) - You’re already using async effectively ⚠️ Things to keep in mind: - Free-threading is still evolving - Thread safety is now YOUR responsibility - Some C extensions may not be ready yet 🔥 The most exciting part? Same code. Same FastAPI app. Just a different Python runtime → 8x improvement. This could seriously change how we design backend systems in Python. Curious — would you switch to No-GIL Python for your APIs? #Python #FastAPI #BackendEngineering #Performance #Concurrency #AI #SoftwareEngineering

To view or add a comment, sign in

More Relevant Posts

Priyank Goswami
1w Edited
Report this post
🚀 Python GIL vs No-GIL: Real FastAPI Benchmarks (Python 3.13) Python is going through one of its biggest shifts in decades — the Global Interpreter Lock (GIL) is now optional in Python 3.13. But the real question is: 👉 Does removing the GIL actually improve real-world performance? A recent benchmark using FastAPI gives some interesting insights 👇 💡 Key Takeaways: 🔹 True parallelism is finally possible With the GIL removed, Python threads can run across multiple CPU cores — something that wasn’t possible before. 🔹 Massive gains for CPU-bound workloads In multi-threaded scenarios, performance can scale significantly (even 3–4x in some cases) when tasks are parallelizable. 🔹 FastAPI doesn’t magically get faster FastAPI is primarily async-based (single-threaded concurrency), so it doesn’t automatically benefit from no-GIL unless you switch to thread-based execution. 🔹 Trade-offs are real Single-thread performance can drop due to added locking overhead Many libraries (NumPy, Pandas, etc.) aren’t fully ready yet Thread-safety becomes your responsibility 🔹 Still experimental Free-threaded Python in 3.13 is not production-ready yet — but it’s a huge step forward. 🔥 What this means for developers: If you’re building CPU-heavy APIs, no-GIL could be a game changer If you rely heavily on async + I/O, impact will be limited The ecosystem still needs time to adapt 👉 Curious to hear your thoughts: Would you adopt no-GIL Python today, or wait for ecosystem stability? #Python #FastAPI #BackendDevelopment #Concurrency #Performance #SoftwareEngineering
Like Comment
To view or add a comment, sign in
Anish Tharur
6d
Report this post
FastAPI + No-GIL Python might be one of the most important backend shifts this year Most of the conversation in AI has been about models getting faster. But something equally important is happening at the runtime level. With FastAPI 0.136.0, support for Python’s free-threaded (No-GIL) builds is now becoming practical to experiment with. I wanted to understand what this actually means in a real API scenario. So I ran a simple benchmark: - Python 3.12 (with GIL) - Python 3.13 free-threaded build (No-GIL) Same FastAPI app Same endpoints No code changes For CPU-bound workloads, I saw up to ~8x improvement. This isn’t surprising when you think about it. For years, the Global Interpreter Lock has limited Python’s ability to fully use multiple cores in a single process. Threads never really meant parallel execution for CPU-heavy tasks. Most of us worked around it using multiprocessing, task queues, or by adding more infrastructure. No-GIL changes that model. Now threads can actually run in parallel across cores, which means CPU-heavy APIs can scale more naturally without increasing system complexity. Where this becomes especially relevant: - ML inference APIs - Data processing pipelines - Feature engineering workloads - Real-time analytics backends That said, there are some important caveats: - Python 3.13’s free-threaded mode is still experimental - Not all libraries are thread-safe yet - The ecosystem will take time to stabilize So this is not a “move everything to No-GIL today” moment. But it is a strong signal of where Python is heading. For a long time, the trade-off was clear: Python was easy to use, but not ideal for CPU-bound parallelism. That trade-off may not hold for much longer. Curious to hear how others are thinking about this. Are you planning to experiment with No-GIL Python, or waiting for the ecosystem to mature?
Like Comment
To view or add a comment, sign in
Arti Kasaudhan
3w
Report this post
🚀 Mastering Recursion with Gray Code Generation I recently worked on an interesting problem—generating Gray Codes using recursion in Python. This problem is a great example of how powerful and elegant recursive thinking can be. 🔹 What is Gray Code? Gray Code is a binary sequence where two consecutive values differ in only one bit. It has applications in digital systems, error correction, and algorithms. 🔹 Approach Used: Instead of generating all binary numbers and converting them, I used a recursive pattern: Base case: For n = 1 → ["0", "1"] Recursively get Gray codes for n-1 Prefix "0" to the original list Prefix "1" to the reversed list Combine both 🔹 Python Implementation: class Solution: def graycode(self,n): if n ==1: return ["0","1"] prev_gray = self.graycode(n - 1) result = [] for code in prev_gray: result.append("0" + code) for code in reversed(prev_gray): result.append("1" + code) return result 💡 Key Learning: Sometimes the best solutions don’t require complex logic—just recognizing patterns and applying recursion smartly. 📈 This problem strengthened my understanding of: Recursion Pattern building Problem decomposition Would love to hear how others approached this problem or optimized it further! 😊 #Python #Recursion #Algorithms #CodingJourney #DataStructures #ProblemSolving
Like Comment
To view or add a comment, sign in
Volodymyr Sydorenko
1w Edited
Report this post
Level Up Your Python API Design: Mastering / and * Have you ever looked at a Python function signature and wondered what those / and * symbols actually do? While many developers stick to standard arguments, modern Python (3.8+) provides surgical precision over how functions receive data. Understanding this is key to building robust, self-documenting APIs. Check out this "Ultimate Signature" example: def foo(pos1, pos2, /, pos_or_kwd1, pos_or_kwd2='default', *args, kwd_only1, kwd_only2='default', **kwargs): print( f"pos1={pos1}", f"pos2={pos2}", f"kwd_only1={kwd_only1}", # ... and so on ) The Breakdown: Positional-Only (/): Everything to the left of the slash must be passed by position. You cannot call foo(pos1=1). This is perfect for performance and keeping your API flexible for future parameter renaming. Positional-or-Keyword: The "classic" Python parameters that can be passed either way. The Collector (*args): Grabs any extra positional arguments and packs them into a tuple. Keyword-Only: Everything after *args (or a standalone *) must be named explicitly. This prevents "magic number" bugs and makes the intent of the caller crystal clear. The Dictionary (**kwargs): Catches any remaining keyword arguments. Why should you care? Good code isn't just about making it work; it’s about making it hard to use incorrectly. By using these boundaries, you create a strict contract. You force clarity where it’s needed (Keyword-Only) and allow flexibility where it’s not (Positional-Only). Are you using these constraints in your daily development, or do you prefer keeping signatures simple? Let’s discuss below! 👇 #Python #SoftwareEngineering #CleanCode #Backend #ProgrammingTips #Python3 #CodingLife

5 Comments
Like Comment
To view or add a comment, sign in
Manan Chugh
2w Edited
Report this post
Our Python service had a memory leak… but gc.collect() said everything was fine. Our Python document parsing service (PDF → OCR → Gemini APIs) started crashing with OOMs. Memory kept increasing after every document 📈 Eventually → OOM crashes Look at the image 👇 Top = before (slow memory growth) Bottom = after (stable) The tricky part? No obvious leak. gc.collect() was already there. Profilers showed nothing. What was actually happening: • Creating a new genai.Client() per request → sockets & connection pools never released • C-libraries (PyMuPDF, PIL, OpenCV) using malloc() → glibc holds memory, doesn’t return it to OS • Cleanup missing in exception paths → leaked temp files & buffers • Large objects staying in memory too long Fixes: ✔ Reused a single client ✔ Added: ctypes.CDLL("libc.so.6").malloc_trim(0) ✔ Moved cleanup to finally ✔ Explicitly closed & deleted large objects 💡 Takeaway In Python systems using C extensions: ➡️ gc.collect() is NOT enough ➡️ Memory leaks can live outside Python ➡️ Understanding the OS allocator matters Same system. Same workload. Completely different memory behavior. #backend #python #debugging #engineering
4 Comments
Like Comment
To view or add a comment, sign in
Shikhar Shrivastav
3w
Report this post
spent the last few days reading the original HNSW paper and rebuilding it from scratch in Python. HNSW is the algorithm running under every vector database you've heard of. Pinecone, Qdrant, Weaviate. most people just pip install and move on. I wanted to know what's actually in there. the idea is a layered graph. bottom layer has every node, densely connected. higher layers are sparser, fewer nodes, longer jumps. when you search you enter at the top, take big jumps to get in the right neighborhood, then drop down layer by layer to zero in. total hops maybe 15. brute force would check everything. the thing that genuinely got me was that layer assignment is random, not learned. turns out that's on purpose. any deterministic scheme creates hotspot nodes that everybody routes through and they become bottlenecks. randomness spreads the load. simple idea, non obvious consequence. numbers from my implementation at N=5000, 128 dim vectors: ef=10 gives 32.4% recall at 335 QPS, 13.6x faster than brute force ef=50 gives 76.3% recall at 114 QPS ef=100 gives 92% recall at 69 QPS ef=200 gives 99.1% recall at 40 QPS one number controls the whole speed accuracy tradeoff. no reindexing, no rebuild, just change ef at query time. also looked at M, the max neighbours parameter. M=8 gets you 52% recall, M=32 gets you 94.4% but builds a shallower graph because each node is already so well connected you need less hierarchy. that tradeoff took me a while to actually feel. this is a teaching implementation, pure Python, every line commented. production systems rewrite the inner loop in Rust with SIMD which is why they are orders of magnitude faster. the whole point here was just to understand what the algorithm is actually doing. code is up at https://lnkd.in/gjYr-C8P if you want to follow along with the paper side by side. storage engine next.
2 Comments
Like Comment
To view or add a comment, sign in
Bacancy

116,517 followers
1w
Report this post
Python is the world's number one language for AI. It's also how most teams accidentally build their worst technical debt. We've reviewed 50+ Python codebases. The same 4 mistakes appear every time. Swipe to see what to fix before your codebase becomes a liability. → Mistake 1: No type hints → Mistake 2: Notebooks in production → Mistake 3: Unpinned dependencies → Mistake 4: Sync where you need async The best Python codebases we've worked on share one thing: They were written as if the team expected it to still be running in 5 years. Type hints. Tested modules. Pinned deps. Async where it matters. That discipline is the difference between a Python product and a Python project. Bacancy builds Python systems that scale. DM us if you're inheriting one that doesn't. #Python #PythonDevelopment #CleanCode #TechnicalDebt #SoftwareEngineering #BackendDevelopment #EngineeringLeadership #HirePythonDevelopers

1 Comment
Like Comment
To view or add a comment, sign in
Zachary Miller
3d Edited
Report this post
I just shipped a new feature to pydepgate: partial decode. When the scanner finds a high-entropy encoded blob, it now attempts to decode it and show you what's inside - without executing any of it. Here's what that looks like against the litellm 1.82.8 wheel. The 34,460-character string in proxy/proxy_server[.]py that I flagged in my last post? pydepgate now decodes it automatically. One layer of base64, 34,460 characters down to 25,844 bytes. Final form: Python source code. What's in that Python source? The first thing the decoder sees is import subprocess. Then import tempfile. Then import os. Then a PEM public key block. That's a complete second-stage payload, encoded and sitting inside a production Python package used by thousands of developers. The outer package does its advertised job. The encoded blob waits. pydepgate didn't execute it. It didn't import it. It decoded the bytes statically, identified the content type, extracted the indicators, and showed you the hex. The tool's job is to tell you what's there before anything runs - and now it can tell you more precisely what "there" is. --peek to enable decoding. --peek-chain to follow multi-layer encoding if the first decode produces another encoded blob. Still zero dependencies. Still stdlib only. pydepgate 0.2.0 is on PyPI now. By the way there's over 700 unittests, this project is covered extensively.
Like Comment
To view or add a comment, sign in
Rana harshraj Singh
3d
Report this post
I was working on modelling and asked an LLM to parallelize some Python code. While reviewing, it gave me ThreadPoolExecutor. Looked right. Wasn't. I knew ThreadPoolExecutor doesn't give you true parallelism for CPU-bound work because of the GIL. Multiple threads, but only one runs Python bytecode at a time. Then I remembered discussing this with Ananth Shankar Narayanan around 7 months back that Python 3.13 shipped a free-threaded build (no-GIL) as an (experimental) feature. So I went down the rabbit hole: why not just use that? Turns out most C-extension libraries like NumPy and pandas were written assuming the GIL exists as an implicit lock. Remove it and you expose race conditions in their internal state. The ecosystem is still catching up on making those extensions thread-safe. So for now, the pragmatic fix is ProcessPoolExecutor. Each subprocess gets its own GIL and its own memory space, so you get actual parallelism without breaking half your dependencies. LLMs are good at writing code. Less good at understanding if that really helps you (the user). Pay attention to these nitty-gritties, this will make you apart and I can't say this enough!!
Like Comment
To view or add a comment, sign in
Ekanth Dara
1w
Report this post
🚨 If performance is everything, then how come Python continues to dominate in AI/ML/DL? This is one of the most common questions that arise when one talks about programming languages like C, C++ and Java whose performance is way higher compared to Python. Developers consistently choose Python. Here's why: 1️⃣ Ecosystem over raw speed: Python's rich libraries like TensorFlow, PyTorch, scikit-learn removes the need to build complex algorithms from scratch, accelerating innovation. 2️⃣ Simplicity that scales productivity: Python code is easy to understand and maintain which allows faster experiments, iterations, and implementations, something very important when developing intelligent systems. 3️⃣ Strong community and continuous evolution: A global community actively contributes to frameworks, tutorials, and tools, making problem-solving faster and more collaborative. 4️⃣ Integration with optimized backend engines: There's no big difference since Python often serves as a frontend layer over the optimized C/C++ codebase. 5️⃣ Focus on results not optimization: In AI, time-to-solution and experimentation matter more than micro-level performance gains. The reality is: Python isn't replacing faster languages, it's orchestrating them. #Python #MachineLearning #ArtificialIntelligence #DeepLearning #SoftwareEngineering #TechInsights #Innovation
Like Comment
To view or add a comment, sign in

1,006 followers

14 Posts

View Profile Connect

Python GIL vs No-GIL Benchmarks with FastAPI

More Relevant Posts

Explore content categories