Profiling and Speeding Up Your Code

Farid E.

Published May 15, 2025

High-Performance Python: Profiling and Speeding Up Your Code

Writing fast Python code isn’t black magic - it’s about measuring first and then applying targeted optimisations. It’s easy to feel frustrated by slow code, but a systematic approach can empower you to improve performance dramatically. This guide will help you profile code to find real bottlenecks and then speed up critical sections using Cython and Numba. Throughout, we’ll focus on practical techniques, real-world examples, and an emphasis on clarity and empathy.

Profiling Python Code

Before rushing to optimize, remember the famous advice: “premature optimization is the root of all evil”. In other words, don’t guess where the slowness is—profile your code to find out. Often, a few hot spots cause the majority of the slowdown (the 80/20 rule), so improving just those parts yields most of the benefits. By measuring first, you avoid wasting time on code that isn’t actually a bottleneck.

Key profiling tools in Python include:

timeit – a quick way to benchmark small code snippets or functions. It runs code multiple times and reports average runtime. For example, in a Jupyter notebook you can use the %timeit magic to time a snippet (it will handle repetitions and noise for you). Use timeit when comparing two approaches for a small operation (e.g., list comprehension vs. loop) to see which is faster.
cProfile – Python’s built-in deterministic profiler for function-level analysis. It tracks every function call and how much time each takes. You can run it from the command line (python -m cProfile) or within your code. For instance, using cProfile in a with statement can profile a specific block of code and then you can use the pstats module to sort results by time or number of calls. The output will show each function, how many times it was called, the total time spent in it, and cumulative time including sub-calls. Focus on functions with high total or cumulative time, as they are your bottlenecks.
line_profiler – a tool for line-by-line timing within functions. It’s especially useful when cProfile points to a heavy function and you need to see which lines inside are slow. You mark the function with a @profile decorator (after importing line_profiler), run your script with kernprof -lv, and get a report of time spent on each line. This often reveals surprises, like a simple line that takes 90% of the time. For example, line-profiler on a simple loop might show that a particular operation (like parsing a string or making an I/O call) is the real culprit slowing things down. In one example, printing to the console in a loop took about 6% of the program’s runtime just by itself – a small percentage, but in tighter loops or large outputs, I/O can dominate runtime. Use line_profiler in development to pinpoint such issues (you wouldn’t run it in production due to overhead).

Interpreting profiler output: When you profile, look for lines or functions that consume the largest fraction of time. These “hot spots” are where optimisation will pay off most. For cProfile reports, pay attention to the cumulative time if a function calls others (it shows total time spent including subcalls) and the tottime (time spent in the function’s own code). A function with high cumulative time might be calling a slower sub-function repeatedly. In contrast, line profiler output directly tells you which line is the slowest. Profiling data takes the guesswork out of optimizstion – it’s often eye-opening to see, for instance, that string parsing or formatting was taking more time than your numerical algorithm, or that a seemingly small nested loop was actually being called millions of times.

Best practices for profiling:

Profile with realistic workloads: Make sure you run the profiler on input data or scenarios that reflect real usage. Optimising for a toy dataset might mislead you if production data is larger or differently structured.
Avoid premature fixes: Don’t start changing code until you’ve seen the profiler output. Often, what you think is slow isn’t the real issue. Profiling prevents false assumptions. For example, you might suspect a database query is slow, but a profile could reveal it’s actually your JSON serialisation or a regex that's eating most of the time.
Use the right tool for the job: Use timeit for micro-benchmarks and quick experiments, cProfile for a broad overview of program performance, and line_profiler to zoom in on specific functions. Each has its place. If memory usage is a concern, Python also has memory profilers (like memory_profiler) – not our focus here, but worth noting.
Profiling in development vs. production: In development, you can profile freely – the slowdown doesn’t matter. In production, however, running a full profiler like cProfile continuously is impractical because it adds significant overhead (often 2-10x slowdown). Instead, profile in a staging environment or on recordings of production workloads. If you must profile in production, use sampling profilers (which capture stack traces at intervals) to minimise impact. For instance, Dropbox’s engineering team built a sampling profiler to run with under 2% overhead in production, because standard cProfile was too intrusive. Tools like Pyinstrument (a sampling profiler) can also be used to get a statistical picture of performance with much less overhead than tracing every function call. The bottom line: use heavy instrumentation in dev, and lighter sampling in prod if needed.
Iterate and verify: Treat optimisation as an iterative process. Profile, identify a hot spot, fix or improve it, then profile again to verify the effect. This ensures you actually moved the needle. It’s very satisfying to see a function that took 5 seconds now take 0.5 seconds after optimisation – and the profiler will prove it. Then check if a new bottleneck emerges (often once you speed up one part, another part becomes the slowest).
Watch out for hidden slowdowns: Profiling often reveals non-obvious issues. Some common performance surprises include:

In summary, measurement comes first. Profiling guides your efforts so you spend time where it matters most. Once you’ve identified a true bottleneck in pure Python code, it’s time to speed it up—often by moving work out of pure Python. This is where tools like Cython and Numba come in.

Speeding Up Python with Cython and Numba

So your profiler identified a slow function or loop that’s eating up most of your runtime. Why is it slow? Likely because Python is doing a lot of work per operation: dynamic typing, bytecode interpretation, and the Global Interpreter Lock all add overhead, especially in tight loops or heavy numerical computations. In pure Python, each iteration or function call has expense that a lower-level language wouldn’t. While vectorised libraries like NumPy leverage fast C code under the hood to sidestep Python’s slowness, not every problem can be neatly vectorised. That’s where Cython and Numba come into play, offering ways to execute Python code at (nearly) C speed by bypassing the Python interpreter for critical sections.

Why Python can be slow in numeric code: Python’s strength is flexibility, but that comes at the cost of speed. Every time you add two numbers in Python, for example, the interpreter has to check types, allocate objects, manage reference counts, etc. If you’re looping over millions of elements, that overhead multiplies. Tools like Cython and Numba mitigate this by either compiling the code to a more efficient form or just-in-time (JIT) translating it to machine code. In essence, they eliminate a lot of Python’s runtime overhead in the parts of your code that you’ve identified as performance-critical.

Using Cython for Big Speedups

Cython is a programming superset of Python that allows you to add static type declarations and compile Python code into C. It’s often described as “Python with C types”. With Cython, you write a .pyx file (or even use a special annotation syntax in a normal .py file with Python 3.8+ type hints) where you can optionally declare C types for variables, arguments, etc. A C compiler then translates this into a Python C extension module. The result is that your heavy Python code runs as fast, low-level C code. Critical sections can easily become 10x to 100x faster than pure Python if you leverage static typing - essentially you’re now as fast as C. For example, a naïve Fibonacci loop implemented in Cython with integer types can run an order of magnitude faster than the same loop in pure Python, because all the math happens in C with no interpreter overhead.

Recommended by LinkedIn

Google Fire - The Fastest Way to Build Python CLIs

Nuno Bispo 6 months ago

Good Practices for Documenting Python Code

Claudio Shigueo Watanabe 3 years ago

From Dynamic to Typed: Leveling Up Python

Arthur Pinheiro Côrtes 11 months ago

To use Cython, you typically identify a hotspot (say, a tight loop doing math or processing), and move it into a Cython module. As a simple illustration, imagine you have a Python function to sum the squares of a list of numbers:

def sum_of_squares(lst):
    total = 0

    for x in lst:

        total += x * x

    return total

If profiling shows this function is a bottleneck, you could write a Cython equivalent:

def sum_of_squares_cython(double[:] arr):  
    cdef Py_ssize_t i, n = len(arr)

    cdef double total = 0.0

    for i in range(n):

        total += arr[i] * arr[i]

    return total

By declaring types (double for numbers, using a C array view for the list) and compiling with Cython, you eliminate Python’s dynamic dispatch in the loop. This Cython version will run dozens of times faster than the pure Python version for large arrays (in fact, it will be on par with C performance). Cython gives you fine-grained control: you can add types only where needed. Often, just a few strategic type annotations yield huge gains – you don’t have to convert your entire codebase. In practice, major libraries like scikit-learn and pandas use Cython (or C/C++ directly) under the hood for heavy lifting. Cython is ideal when you need that level of control and maximum performance, or when you want to interface with existing C/C++ code. It does require a bit more effort (you need a C compiler and a build step), but the payoff can be large. As one tutorial put it: “Cython blends Python’s usability with C’s performance”.

When using Cython, you are essentially writing a hybrid of Python and C, so it’s most effective for algorithmic code where Python is the bottleneck (CPU-bound work). It’s less useful for speeding up Python’s already-fast I/O or network operations (which are often waiting on external resources), but great for math, loops, and data processing. One thing to keep in mind: because you’re adding types, your code loses some of Python’s dynamic flexibility in those sections. That’s a trade-off you make for speed. In a large project, a good strategy is to keep most code in normal Python (for readability and flexibility) and only Cythonize the bottleneck functions. Profile first to find those bottlenecks, then apply Cython – this is exactly how projects like scikit-learn approach it (write in Python, profile the slow parts, rewrite those in Cython).

Using Numba for Instant Boosts

Numba takes a different approach: it’s a JIT compiler that works at runtime. You don’t need to write a new file or explicitly declare types (although you can, to hint Numba). Instead, you decorate a Python function with @numba.njit (or @numba.jit) and call it as usual. The first time it runs, Numba will compile a optimized machine-code version of that function for the data types it sees (using the LLVM compiler infrastructure under the hood). The next times, it will use the cached compiled version. Numba focuses on numerical Python: it excels at loops, array arithmetic, and functions that mostly use NumPy and basic Python types (ints, floats). It won’t speed up your Python string manipulations or make your database queries faster, but if you have a Python loop doing math, it can often remove the Python overhead entirely.

The biggest win of Numba is how easily you can get a speed boost. Often it’s literally a one-line change: add a decorator. For example, consider you have a function computing a running sum in a loop. Pure Python might take a couple seconds for large arrays. Simply adding @numba.njit (no Python code changes at all) could turn that into milliseconds. A real-world case: a loop that took 2.5 seconds in pure Python ran in 0.19 seconds after applying Numba – a 13× speedup with essentially one line of code change. That’s a quick win! Numba achieves this by compiling the loop to efficient native code and eliminating the Python interpreter overhead, similar to what Cython does, but on the fly. It’s like having a Just-In-Time optimising compiler for your Python function.

Numba works best when you follow a few guidelines: use NumPy arrays and functions (it can recognise and optimise many NumPy operations), stick to Python primitives (avoid dynamic Python tricks or unsupported libraries inside the jitted function), and if possible, use its “nopython” mode (@njit ensures it doesn’t fall back to Python). When you use @njit, Numba will complain if it can’t compile something, rather than silently running it in slow Python mode. Numba can also automatically parallelize loops and even offload to the GPU with minimal changes, but even on a single CPU core it often provides massive speedups for numerical code. It’s part of the scientific Python ecosystem (sponsored by Anaconda) and is a powerful tool for those who need speed but want to keep writing in Python.

Cython vs Numba: Which to Choose?

Both Cython and Numba can achieve similar end results: your code runs much faster by leveraging compiled execution. They do, however, have different strengths:

Ease of use: Numba is usually easier to get started with – you don’t need to learn new syntax or set up a build, just add a decorator. It’s great for a quick fix or in research/analysis code where you want speed in a hurry. Cython requires writing (or generating) a Cython file and compiling it. This involves a bit more setup (including a C compiler). That said, tools like setuptools or even Jupyter Cython magics can streamline this. If your team is comfortable with a C/C++ toolchain, Cython isn’t too hard, but for many developers, Numba feels more accessible at first.
Control and flexibility: Cython offers fine-grained control over types, memory, and calling into C/C++ code. You can hand-tune performance critical code almost like you would in C or C++. This makes it ideal for writing high-performance library code (indeed, SciPy, scikit-learn, and others use Cython for many algorithms). You can also release the Global Interpreter Lock (GIL) in Cython and multi-thread manually, or use OpenMP, etc. Numba, conversely, is more automated – it decides how to compile your code. It’s less flexible if you need to do something unusual, and not all Python features are supported. For example, Numba has limited support for Python OOP or certain libraries. If your code doesn’t fit Numba’s constraints, you might “hit a wall” where Numba just can’t be applied without major refactoring. Cython, being more manual, can always be made to work (you can even drop to pure C in Cython if needed). So if you have very custom needs or need to integrate with low-level libraries, Cython is the better choice.
Performance: In many cases, both tools can reach similar top-end performance. When used correctly, a Numba-compiled function and a Cython-compiled function can often be within a few percent of each other in speed. Neither is categorically “faster” across the board; it depends on the use case and how much you optimise. Cython might edge out Numba if you heavily hand-tune types and utilise C-level optimizations (since you can micro-optimise), whereas Numba might beat a non-fully-optimised Cython code by aggressively inlining and vectorising under the hood. But generally, expect comparable performance from both. One difference: Numba has a JIT warm-up cost (the first call includes compilation time), whereas Cython is compiled upfront. For long-running tasks this doesn’t matter, but for short one-off scripts, Cython code runs at full speed immediately, while Numba functions might take a moment on first use.
Development and ecosystem: Cython has been around for over a decade and is very stable. Code written in Cython becomes a normal Python extension module, which you can distribute with your package (users don’t even need Cython installed to use it). Numba is a newer project and while it’s robust, it does require that your end-users have Numba and a compatible CPU, etc., since it compiles on the fly. Numba is fantastic in a controlled environment (like a data pipeline or a research environment where you know you can use it). Cython is often preferred for widely distributed libraries because it doesn’t add runtime dependencies (just a one-time build). That said, if you control the deployment, using Numba is just fine. Both projects are actively developed, and Numba’s capabilities have grown (e.g., support for more Python features and NumPy functions over time).

In summary, choose the tool based on your needs: If you want a quick speed boost with minimal code changes, try Numba first. If you need maximum performance, or are developing a library or a performance-critical component that you’ll maintain long-term, Cython is a great choice. There’s no rule saying you can’t use both in one project either – for example, you might use Numba during rapid prototyping and later port the bottleneck to Cython for a production release. Remember, neither tool is a magic wand for all problems. They shine for CPU-bound tasks, especially numerical computations. They won’t help much with I/O-bound code or remove network latency, etc. Always consider whether using optimised pure Python libraries (NumPy, pandas, etc.) can solve the issue first, as that keeps your code simple. But when you have to write that custom loop or algorithm, Cython and Numba are lifesavers.

Finally, don’t be intimidated by these tools. Start small: maybe use %%timeit to spot a slow loop, slap @njit on it and see what happens. Or generate a Cython module for a critical function and measure the speedup. It can be an eye-opening experience to see a Python routine go from minutes to seconds. As you gain confidence, you’ll develop an intuition for what kind of code is worth optimizing and which tool to use. Always keep correctness and clarity in mind-make it work, make it clean, then make it fast. By following a profile-driven approach and leveraging tools like Cython and Numba when appropriate, you’ll be well-equipped to write Python code that is not only clear and maintainable, but also blazing fast where it counts.

Profiling and Speeding Up Your Code

Farid E.

High-Performance Python: Profiling and Speeding Up Your Code

Profiling Python Code

Speeding Up Python with Cython and Numba

Using Cython for Big Speedups

Recommended by LinkedIn

Using Numba for Instant Boosts

Cython vs Numba: Which to Choose?

Road to 10x - Python Mastery

120 followers

More articles by Farid E.

Others also viewed

Pyre: Type Checker for Python

Mastering Python Function Naming Conventions

What Is the Purpose of init in Python?

Python Micro-Optimizations: Worth It or Waste of Time?

The Zen of Python: Sparse is Better Than Dense

A "Huh" Moment with Python: The Quirk of str.join() and Generators

Unpacking Python Integers – More Than Just a Number

Understanding sys.argv in python

Still Confused About Python Packages and Executable Modules? Here's the Fix. 💡

Explore content categories

High-Performance Python: Profiling and Speeding Up Your Code

Profiling Python Code

Speeding Up Python with Cython and Numba

Using Cython for Big Speedups

Recommended by LinkedIn

Using Numba for Instant Boosts

Cython vs Numba: Which to Choose?

Road to 10x - Python Mastery

120 followers

More articles by Farid E.

🐍 Continuous Learning in Python

Building Real-World Object Detection Systems

How Modern Speech-to-Text Systems Work (A Systems Engineering Perspective)

Monitoring, Observability, and Alerting in Python: A Practical Guide

How ChatGPT “Consumed the Internet”: Web-Scale Crawling & Data Pipeline

From Local Script to Production‑Ready Python: A Practical Guide

Designing an AI-Wrapper Architecture for a Code Assistant

Mastering SQLAlchemy for Scalable Data-Intensive Python Apps

Deep Dive: Shopify’s System Design at Global Scale

Mastering Debugging: From Print Statements to Production-Ready Techniques

Others also viewed

Pyre: Type Checker for Python

Mastering Python Function Naming Conventions

What Is the Purpose of __init__ in Python?

Python Micro-Optimizations: Worth It or Waste of Time?

The Zen of Python: Sparse is Better Than Dense

A "Huh" Moment with Python: The Quirk of str.join() and Generators

Unpacking Python Integers – More Than Just a Number

Understanding sys.argv in python

Still Confused About Python Packages and Executable Modules? Here's the Fix. 💡

Similar topics

How to Improve Code Performance

How to Use Python for Real-World Applications

Python Tools for Improving Data Processing

Explore content categories

What Is the Purpose of init in Python?