Niels Cautaerts’ Post

My former colleague Hossein Ghorbanfekr and I recently wrote a book on GPU computing in Python. While many Python programmers, data scientists, and researchers rely on GPU acceleration through high-level frameworks like PyTorch, we noticed that few grasp what’s happening under the hood. Historically, low-level GPU programming was the domain of C/C++ developers, leaving Python users dependent on high-level libraries that wrap low-level code written by someone else. These days, tools like the Numba JIT compiler and the Numba-CUDA backend enable Python developers to write high-performance, low-level GPU code without switching languages. Our book, GPU-Accelerated Computing with Python 3 and CUDA, aims to make CUDA accessible to Python programmers who want to dig one level deeper or need more control over their GPU-accelerated code. We start with the fundamentals: writing and executing CUDA kernels, managing streams, profiling performance, and understanding memory hierarchies. Everything is taught through Python, using Numba-CUDA. We then try to connect these concepts to high-level libraries like CuPy and RAPIDS, which integrate seamlessly with the scientific Python ecosystem. We also included JAX as a flexible framework for differentiable programming and machine learning on GPUs and other accelerators. In the last third of the book, everything is combined to address practical applications, including solving the heat equation, detecting objects in images, simulating atomic interactions, and building + training a small transformer-based language model. This project took a lot of evenings, weekends, and holidays over 1.5 years, but we hope we managed to make something that will benefit other researchers, data scientists, and engineers. We’re grateful to Packt for the opportunity to bring this book to life. The e-book is available now on Amazon (https://a.co/d/03VXXelq), and the print version will be out in a few weeks. This is not an April fool's joke. #gpu #hpc #python #CUDA #numba #scientificcomputing #machinelearning #RAPIDS #cupy #JAX

GPU-Accelerated Computing with Python 3 and CUDA: From low-level kernels to real-world applications in scientific computing and machine learning amazon.com

To view or add a comment, sign in

More Relevant Posts

Aditi Sikarwar
3w
Report this post
A Python loop. 662 nanoseconds per iteration. Add two characters. Same loop. Same algorithm. 50–200× faster. That's @jit, and understanding why it works is a systems-level education. I break it down here , it covers: ▸ Why Python is structurally slow (not just "interpreted" it's the boxing, type dispatch, and GC pressure on every single loop iteration) ▸ What Numba actually is under the hood -a 5-stage compilation pipeline: Python bytecode → type inference → Numba IR → LLVM IR → machine code or CUDA PTX. The same backend Clang uses. ▸ A real benchmark breakdown -pure Python (662 ns) vs Numba (193 ns) vs built-in C (128 ns) and why Numba doesn't always win, and when it does win massively ▸ The HPC memory hierarchy explained - registers, L1/L2 cache, DRAM, PCIe, GPU HBM and why the most common GPU bottleneck isn't compute, it's the data transfer ▸ CUDA C++ vs pyCUDA vs Numba - a side-by-side comparison of when to use which, with no fluff ▸ The Monte Carlo Pi exercise - how adding @jit to a 1M-iteration loop gives 50–200× speedup, and why this is the sweet spot Numba was built for ▸ The core architectural insight: Python is a control plane, not a compute plane the same pattern behind PyTorch, TensorFlow, and JAX #Python #Numba #GPU #CUDA #HPC #DataScience #MachineLearning #ScientificComputing #PerformanceEngineering #NumPy #SoftwareEngineering

The Speed of Light in Python: How Numba Brings GPU Acceleration to Scientific Computing medium.com
Like Comment
To view or add a comment, sign in
Linde Center for Science, Society, and Policy (LCSSP) at Caltech

628 followers
1w Edited
Report this post
🔊 New paper ("Fast-ER: GPU-Accelerated Record Linkage and Deduplication in Python") published in The Journal of Open Research Software (JORS): 🔶Jacob Morrier, Analysis Group 🔶Sulekha Kishore, Massachusetts Institute of Technology 🔶Michael Alvarez, Caltech Paper available here: https://lnkd.in/g7uPvHvE

Fast-ER: GPU-Accelerated Record Linkage and Deduplication in Python | Journal of Open Research Software openresearchsoftware.metajnl.com
Like Comment
To view or add a comment, sign in
Jacob Morrier
1w
Report this post
I’m excited to share that our software metapaper on Fast-ER—our Python package for GPU-accelerated record linkage and deduplication—has been published in the Journal of Open Research Software. Huge thanks to Sulekha Kishore and Michael Alvarez for their outstanding work in making this package a reality. If you’re working on record linkage and deduplication problems, I encourage you to check out the paper and give Fast-ER a try! 🔗 Link to the paper: https://lnkd.in/eVRyeMW2 🔗 Link to the package documentation: https://lnkd.in/eJbYdAuY

Linde Center for Science, Society, and Policy (LCSSP) at Caltech

628 followers
1w Edited

🔊 New paper ("Fast-ER: GPU-Accelerated Record Linkage and Deduplication in Python") published in The Journal of Open Research Software (JORS): 🔶Jacob Morrier, Analysis Group 🔶Sulekha Kishore, Massachusetts Institute of Technology 🔶Michael Alvarez, Caltech Paper available here: https://lnkd.in/g7uPvHvE

Fast-ER: GPU-Accelerated Record Linkage and Deduplication in Python | Journal of Open Research Software openresearchsoftware.metajnl.com

1 Comment
Like Comment
To view or add a comment, sign in
Dr. Andy R. Terrel (he/him)
1w
Report this post
Give it a listen to see the work we are doing to help make Python packaging work with accelerated computing!

Jay Gould

Developer Marketing Group Manager – OSS & Open Model Partner Programs at NVIDIA
2w

Podcast = Wheel Next + Packaging PEPs NVIDIA - Astral - Quansight on 'Talk Python' https://lnkd.in/gm2JTDsg

Wheel Next + Packaging PEPs talkpython.fm

2 Comments
Like Comment
To view or add a comment, sign in
Quansight

5,934 followers
1w
Report this post
The strength of Python lies in its community, and at Quansight, we are excited to contribute to a cross-industry initiative that makes the Python packaging ecosystem more capable and flexible. Our co-CEO Ralf Gommers, recently joined host Michael Kennedy on the Talk Python Training podcast together with Jonathan Dekhtiar from NVIDIA and Charlie Marsh from Astral, to discuss the collaborative work on WheelNext and wheel variants. Working alongside many other companies and open source projects, we are developing a new standard for hardware-aware packaging. This effort ensures that whether you are using a specialized GPU or a modern CPU, the tools you rely on, like NumPy and PyTorch, will automatically perform at their best. Listen to the full conversation here: https://lnkd.in/gnxvNdM6

Wheel Next + Packaging PEPs talkpython.fm

2 Comments
Like Comment
To view or add a comment, sign in
Erika Sánchez-Femat
1mo
Report this post
Few months ago I wrote a GPU-native Python library that makes radiomic feature extraction 25× faster than PyRadiomics, fully IBSI-compliant, and numerically identical to within 10⁻¹¹. Now the preprint is out here: https://lnkd.in/gFiBw_ep GitHub: https://lnkd.in/gAN2XGCT #Radiomics #MedicalImaging #MachineLearning #OpenSource #ImageProcessing

GitHub - helloerikaaa/fastrad: GPU-accelerated Python library for calculating radiomics features from medical images github.com
Like Comment
To view or add a comment, sign in
Yubisono P.

Experienced Credit Specialist with a demonstrated history of working in the Financial Services Industry. Data Scientist and Machine Learnings using Python, SQL, PostgreSQL, Tableau, Pentaho, Chat GPT, Gemini 2.5 Flash
1w
Report this post
Recommender Systems using implicit #machinelearning #datascience #recommendersystems #implicit Collaborative Filtering for Implicit Feedback Datasets Fast Python Collaborative Filtering for Implicit Datasets. This project provides fast Python implementations of several different popular recommendation algorithms for implicit feedback datasets : Alternating Least Squares as described in the papers Collaborative Filtering for Implicit Feedback Datasets and Applications of the Conjugate Gradient Method for Implicit Feedback Collaborative Filtering. Bayesian Personalized Ranking. Logistic Matrix Factorization Item-Item Nearest Neighbour models using Cosine, TFIDF or BM25 as a distance metric. All models have multi-threaded training routines, using Cython and OpenMP to fit the models in parallel among all available CPU cores. In addition, the ALS and BPR models both have custom CUDA kernels - enabling fitting on compatible GPU's. Approximate nearest neighbours libraries such as Annoy, NMSLIB and Faiss can also be used by Implicit to speed up making recommendations. https://lnkd.in/gwJJr2PS

GitHub - benfred/implicit: Fast Python Collaborative Filtering for Implicit Feedback Datasets github.com
Like Comment
To view or add a comment, sign in
Saidathu Nisa S
1w
Report this post
Recently, I started exploring Python more deeply, and honestly, it’s one of the easiest languages to get comfortable with. What I like about Python is how simple and readable it is. You don’t have to struggle with complex syntax, so you can focus more on solving problems. That’s probably why it’s widely used in areas like data science, machine learning, and automation. While learning Python, I also came across some interesting tools and languages built around it: Hy – It lets you write Python using a Lisp-style syntax. Felt a bit different at first, but it shows how flexible Python really is. Coconut – This one adds functional programming features to Python. Things like pattern matching make the code cleaner in some cases. MyHDL – This was something new for me. It uses Python for hardware design and can convert code into Verilog or VHDL. Pretty interesting to see Python used beyond software. What I understood from all this is that Python is not just a single language—it’s a whole ecosystem that keeps evolving. Still learning, still exploring 🙂 If you’re also learning Python or working in data science, would love to connect and share ideas! #Python #LearningJourney #DataScience #Programming #Tech
Like Comment
To view or add a comment, sign in
Muhammad Afzaal
3w
Report this post
Unlock the power of integrals with Python! 🚀 Dive into three effective methods: analytical solutions, Sympy symbolic integration, and Monte Carlo sampling. Perfect for tackling real-world problems with precision. Enhance your data science toolkit today! Read more: https://lnkd.in/gXCYrhu6 #DataScience #Python #NumericalMethods #MonteCarlo

From Symbolic Math to Random Sampling: Mastering Integral Calculations with Python | TheDataGuy thedataguy.pro
Like Comment
To view or add a comment, sign in
Luca Bonamino
6d
Report this post
Big update to the open-source packages for computing algebraic immunity of Boolean functions !!!! Python: algebraic-immunity -> https://lnkd.in/eGQ__BAc Rust: algebraic_immunity -> https://lnkd.in/e5axJ3gC The new release adds support for restricted algebraic immunity on general subsets, following the algorithms from the paper 𝘊𝘰𝘮𝘱𝘶𝘵𝘪𝘯𝘨 𝘵𝘩𝘦 𝘙𝘦𝘴𝘵𝘳𝘪𝘤𝘵𝘦𝘥 𝘈𝘭𝘨𝘦𝘣𝘳𝘢𝘪𝘤 𝘐𝘮𝘮𝘶𝘯𝘪𝘵𝘺 𝘢𝘯𝘥 𝘈𝘱𝘱𝘭𝘪𝘤𝘢𝘵𝘪𝘰𝘯𝘴 𝘵𝘰 𝘞𝘦𝘪𝘨𝘩𝘵𝘸𝘪𝘴𝘦 𝘗𝘦𝘳𝘧𝘦𝘤𝘵𝘭𝘺 𝘉𝘢𝘭𝘢𝘯𝘤𝘦𝘥 𝘍𝘶𝘯𝘤𝘵𝘪𝘰𝘯𝘴: https://lnkd.in/eTUfwjtw. This is the first public tooling that makes this computation directly available. The packages now support: - algebraic immunity - restricted algebraic immunity - Python bindings backed by a Rust implementation - pre-built wheels for major platforms This remains research-oriented tooling, and there is plenty of room to improve the Rust back-end, especially around performance, documentation, tests, and API ergonomics. Feedback from people working on Boolean functions, cryptography, and computational algebra would be very welcome. Issues, benchmarks and contributions are appreciated across the Python package, the underlying Rust crate, and the main dependency lin_algebra: https://lnkd.in/egrBkKNq.

1 Comment
Like Comment
To view or add a comment, sign in

1,237 followers

24 Posts

View Profile Connect

Niels Cautaerts’ Post

More Relevant Posts

Explore related topics

Explore content categories