How pg_strom boosts PostgreSQL with NVIDIA GPUs

View organization page for Enigma Security

924 followers

6mo

🚀 Accelerating PostgreSQL with GPUs: A 10x Leap in Performance In the world of databases, speed is key to handling large volumes of data. Recently, we explored how an innovative extension transforms PostgreSQL's performance by leveraging the power of GPUs. This not only optimizes complex queries but redefines efficiency in high-traffic environments. 🔧 Understanding pg_strom: The Revolutionary Extension pg_strom is an open-source extension for PostgreSQL that enables offloading database operations to NVIDIA GPUs. Developed to handle compute-intensive tasks like aggregations, joins, and filters, it integrates CUDA directly into the PostgreSQL engine. • 📊 Impressive benchmarks: In tests with datasets up to 1TB, accelerations of up to 10x were achieved in operations like GROUP BY and window functions, compared to standard CPU. • ⚙️ Simple installation: Requires PostgreSQL 11+, CUDA drivers, and compilation from source, but once configured, it's activated with a simple ALTER EXTENSION. • 🛡️ Key considerations: Works best on hardware with powerful GPUs; supports data types like float and array, but still evolving for more features. 💡 Practical Use Cases in Production Companies like MCloud have implemented pg_strom in real clusters, reducing query times from hours to minutes in big data analysis. Ideal for data warehousing, ML preprocessing, and financial applications where latency matters. • 🌐 Scalability: Combines with tools like Citus for horizontal distribution, further boosting throughput. • 🔒 Security and maintenance: Monitor GPU usage via pg_stat_gss and ensure compatibility with common extensions like PostGIS. This advancement demonstrates how modern hardware can revitalize legacy databases, opening doors to innovations in AI and analytics. For more information, visit: https://enigmasecurity.cl #PostgreSQL #GPUComputing #DatabaseOptimization #BigData #TechInnovation #CUDA If this content inspired you, consider donating to the Enigma Security community to support more technical news: https://lnkd.in/evtXjJTA Connect with me on LinkedIn to discuss more about DB optimizations: https://lnkd.in/eZ6TKWs9 📅 Tue, 14 Oct 2025 06:33:27 GMT 🔗Subscribe to the Membership: https://lnkd.in/eh_rNRyt

To view or add a comment, sign in

More Relevant Posts

Luis Oria Seidel

| IT Manager & Cybersecurity Architect | Automation with N8N and Make | Artificial Intelligence | Fortinet® NSE 3 & FCAC® | ISO/IEC 27001 ™ | CAPC™ | Cloud | CSFPC™ | SODFC™ | FBE™ | RWVCPC™ | NIST | ITIL | FCP | CobiT |
6mo
Report this post
🚀 Accelerating PostgreSQL with GPUs: A 10x Leap in Performance In the world of databases, speed is key to handling large volumes of data. Recently, we explored how an innovative extension transforms PostgreSQL's performance by leveraging the power of GPUs. This not only optimizes complex queries but redefines efficiency in high-traffic environments. 🔧 Understanding pg_strom: The Revolutionary Extension pg_strom is an open-source extension for PostgreSQL that enables offloading database operations to NVIDIA GPUs. Developed to handle compute-intensive tasks like aggregations, joins, and filters, it integrates CUDA directly into the PostgreSQL engine. • 📊 Impressive benchmarks: In tests with datasets up to 1TB, accelerations of up to 10x were achieved in operations like GROUP BY and window functions, compared to standard CPU. • ⚙️ Simple installation: Requires PostgreSQL 11+, CUDA drivers, and compilation from source, but once configured, it's activated with a simple ALTER EXTENSION. • 🛡️ Key considerations: Works best on hardware with powerful GPUs; supports data types like float and array, but still evolving for more features. 💡 Practical Use Cases in Production Companies like MCloud have implemented pg_strom in real clusters, reducing query times from hours to minutes in big data analysis. Ideal for data warehousing, ML preprocessing, and financial applications where latency matters. • 🌐 Scalability: Combines with tools like Citus for horizontal distribution, further boosting throughput. • 🔒 Security and maintenance: Monitor GPU usage via pg_stat_gss and ensure compatibility with common extensions like PostGIS. This advancement demonstrates how modern hardware can revitalize legacy databases, opening doors to innovations in AI and analytics. For more information, visit: https://enigmasecurity.cl #PostgreSQL #GPUComputing #DatabaseOptimization #BigData #TechInnovation #CUDA If this content inspired you, consider donating to the Enigma Security community to support more technical news: https://lnkd.in/er_qUAQh Connect with me on LinkedIn to discuss more about DB optimizations: https://lnkd.in/e3jeAYEy 📅 Tue, 14 Oct 2025 06:33:27 GMT 🔗Subscribe to the Membership: https://lnkd.in/eh_rNRyt
Like Comment
To view or add a comment, sign in
Praveet Sinha
6mo Edited
Report this post
𝐖𝐡𝐚𝐭 𝐈 𝐋𝐞𝐚𝐫𝐧𝐞𝐝 𝐟𝐫𝐨𝐦 𝐃𝐢𝐬𝐜𝐨𝐫𝐝’𝐬 𝐉𝐨𝐮𝐫𝐧𝐞𝐲 𝐭𝐨 𝐃𝐢𝐬𝐭𝐫𝐢𝐛𝐮𝐭𝐞𝐝 𝐌𝐋: 𝐒𝐢𝐦𝐩𝐥𝐢𝐜𝐢𝐭𝐲 𝐖𝐢𝐧𝐬 𝐎𝐯𝐞𝐫 𝐒𝐜𝐚𝐥𝐞! 🤩 Discord, the app we all use for chatting, gaming and communities, runs on ML systems, sophisticated models serving hundreds of millions of users, with ads, recommendations and moderations, but subsequently ran into scaling challenges, needing more computing power, multiple GPUs and distributed computing. Discord built a platform taking Ray, an open-source distributed computing framework, as foundation which simplified using GPUs and clusters. As easy as running a command, for ML engineers, turning distributed ML from something hard to use, to something they are excited to work with. During early adoption a need for a standardized, internal “Ray platform” within Discord was felt as different ML engineers were running clusters manually, according to their own needs, solving the same infrastructure challenges with their own solutions. To help ease Ray’s cluster complexity, a single parameterized template was built, which would generate the full cluster specification at runtime. An ML engineer would be required to specify the GPU type, worker count or memory; the CLI handles all the underlying Kubernetes configurations and security settings. Once set up, engineers can submit jobs to their own personalized clusters. This made configurations consistent across teams, prevented YAML headaches, eased up starting or deleting clusters, easier and faster experimentation. The CLI handled the full lifecycle, from creation to deletion making distributed ML pleasant to use. The next step was orchestration, to convert this system from one off, manual training to making sure training jobs can run automatically on schedules. They built a system using three tools working together: -> Dagster – defines workflows and schedules (e.g., “train this model daily”) -> KubeRay – launches Ray clusters on Kubernetes automatically -> Ray – actually runs the distributed ML job Here’s how it works: 1) The engineer presses a button or sets a schedule in Dagster. 2) Dagster sends the job to Ray. 3) KubeRay spins up a GPU cluster. 4) Ray runs the training job across GPUs. 5) Logs and metrics go back to Dagster so engineers can see results. As adoption increased, the need for better observability was observed to see which clusters are running, who owns them, and what’s their status, across the Ray infrastructure, so a web UI was also built showing active clusters, resources and metrics. This infrastructure was tested with their Ads ranking model, the results : -> Model trained faster, scaled better -> Doubled player participation in Quests -> Increased ad coverage from 40% → 100% Discord's Ray platform is now the core of its ML infrastructure, still improving continuously making it faster, simpler, developer friendly! Source : Discord Engineering Blog https://lnkd.in/gqrwMs9z
Like Comment
To view or add a comment, sign in
Sai Sneha Chittiboyina
5mo
Report this post
🚀 Rise of File Formats: From Parquet to the Next Generation There’s no denying the impact #Parquet has had on the analytics ecosystem: ✅ Columnar layout ✅ Strong compression & encoding ✅ The de facto standard for data lakes & warehouses for over a decade ✅ Foundation for open table formats like Apache Iceberg, Apache Hudi, and Delta Lake But our workloads are changing — and so must our file formats. 🔹 Modern workloads are no longer just batch analytics — they now include AI pipelines, real-time streaming, and low-latency serving. 🔹 Hardware isn’t just CPUs anymore — we’re seeing GPUs, ARM, RISC-V, and wide SIMD architectures. 🔹 Performance bottlenecks emerge in decompression, memory pressure, and non-vectorized execution paths. 💡 To address this new reality, we’re witnessing the rise of next-gen file formats designed for modern compute: BTRBlocks – ultra-efficient columnar encoding for modern SIMD architectures Vortex – vectorized compression and compute Nimble – metadata-light and stream-optimized Lance – optimized for machine learning & AI workloads The future of analytics storage is being rewritten — faster, leaner, and GPU-native. It’s an exciting time for anyone in data engineering, analytics, or AI infrastructure. #DataEngineering #AI #FileFormats #DataLakes #Parquet #Lance #BTRBlocks #Vortex #Nimble #DataInnovation #C2C #C2H #OPENTONEWOPPORTUNITIES
Like Comment
To view or add a comment, sign in
Parikshit Ardhapurkar
6mo
Report this post
Understanding File Systems in High-Performance Computing (HPC) When we think of HPC, we often focus on CPUs, GPUs, or interconnects but there’s another hero behind the scenes: the file system. In HPC, how data is stored and accessed can make or break performance. Here are some key types of file systems used in HPC environments 1️⃣ Parallel File Systems These are the backbone of large HPC clusters. They allow multiple nodes to read/write data simultaneously across multiple disks offering both high throughput and scalability. Examples: Lustre, IBM Spectrum Scale (GPFS), BeeGFS, OrangeFS Used for: Scientific simulations, weather models, CFD, and AI training 2️⃣ Distributed File Systems These systems spread data across nodes for redundancy and high availability great for large data analytics workloads. Examples: CephFS, GlusterFS, HDFS Used in: Data-intensive HPC + Big Data hybrid setups 3️⃣ Burst Buffers and Node-Local Storage Acting as a high-speed data cache between compute nodes and the parallel file system, burst buffers handle temporary I/O bursts efficiently. Technologies: NVMe SSDs, Cray DataWarp, BeeOND Used for: Checkpointing, AI workloads, temporary scratch data 4️⃣ Shared File Systems for Management Used for user home directories, logs, and job scripts not for high I/O workloads. Examples: NFS, AFS Used for: Cluster administration and lightweight access In short: Choosing the right file system is as crucial as choosing the right processor it determines how fast your simulations, training, or analysis can read and write data. Next time you run an HPC job, remember: your I/O path might be the real bottleneck. #HPC #Supercomputing #Storage #Lustre #BeeGFS #GPFS #CephFS #ParallelComputing #DataScience #ClusterComputing #HighPerformanceComputing
Like Comment
To view or add a comment, sign in
Abhishek Raj
6mo
Report this post
🚀 Multicore vs Multinode Computing — Which One’s Better? As computing demands rise, the debate between multicore and multinode architectures becomes increasingly relevant — especially in HPC, AI, and large-scale data processing. Let’s break it down 👇 🧠 Multicore Computing ➡️ A single system with multiple cores sharing the same memory. Each core executes tasks in parallel — ideal for shared-memory systems. ✅ Pros: • Low communication overhead • Easier synchronization (shared memory) • Great for OpenMP, Pthread, threading, or GPU-based workloads ❌ Cons: • Limited scalability (core count per CPU is finite) • Shared memory can become a bottleneck 💡 Example: Running OpenMP or CUDA programs on an 8-core or 64-core processor. 🌐 Multinode Computing ➡️ Multiple independent systems (nodes) connected via a network — each with its own memory and CPU. Used for distributed-memory parallelism with message passing. ✅ Pros: • Excellent scalability — just add more nodes • Ideal for massive datasets and simulations • Suited for MPI, OpenMPI ❌ Cons: • Network latency and communication cost • Complex setup and management 💡 Example: Running MPI-based CFD simulations or large-scale ML training on a cluster. 🔗 Hybrid Approach (Best of Both Worlds) ➡️ Combines multicore + multinode (e.g., MPI + OpenMP or MPI + CUDA). Each node runs multiple cores/GPUs locally while communicating globally. ✅ Benefits: • Maximizes resource utilization • Reduces communication overhead between nodes • Enables scalable yet efficient performance 💡 Example: A hybrid MPI+OpenMP job on a supercomputer like PARAM Siddhi or Frontier. 💰 Cost & Scalability 🧩 Multicore: Cheaper, easier to maintain, but limited by hardware. 🌐 Multinode: Costly (needs interconnects, management), but scales almost infinitely. 🔗 Hybrid: Optimal trade-off — performance scales with investment. 🧩 Takeaway Choose wisely based on your workload: 🧠 Multicore — Shared-memory, small to medium workloads. 🌐 Multinode — Distributed, large-scale HPC or ML training. 🔗 Hybrid — When you need scalability and efficiency. 💬 What’s your preferred setup — multicore, multinode, or hybrid? Share your thoughts below 👇 #HPC #ParallelComputing #OpenMP #MPI #CUDA #HybridComputing #Scalability #ClusterComputing #AI #Research #Supercomputing #PARAM #ParallelProgramming #CDAC #MachineLearning #Multinode #Multicore
Like Comment
To view or add a comment, sign in
Harshavardhan K
6mo Edited
Report this post
I was reading this interesting (KVDirect) paper today on a key LLM scaling challenge: KVDirect disaggregated LLM inference. https://lnkd.in/gnA462iE What is disaggregated LLM inference? It splits inference into two separate, scalable services: Prefill: The compute-intensive job of processing the prompt. Decode: The memory-intensive job of generating the response token by token. This is key because they have different resource needs. You can scale prefill workers for high request bursts without touching decode workers. The Bottleneck: The KV Cache This data must move from the prefill worker to the decode worker. Across different nodes, this transfer is a massive bottleneck. Current solutions often force both workers onto one node (using NVLink), killing flexibility and limiting capacity to a single node's memory. Why Standard Networking Fails Synchronization Overhead: Too much CPU involvement. For a tiny 4kb block transfer, 87% of the time is just waiting for CPU synchronization. Fragmentation: The cache is in many small 4kb blocks, but network libraries are built for large transfers. This "chatty" process uses only 1.8% of network bandwidth. Memory Idling: The decode worker allocates memory and then waits for the cache, which can account for 65% of total latency. The solution was to use a a "Pull" Model with RDMA. It's called pull model because the decode worker requests for the blocks instead of prefill worker pushig the data. It goes like this..The actual transfer is initiated by the decode worker. It uses the meta data to calculate the exact location of the blocks it needs and initiates the transfer via the gpu rdma. So the gpu is talking directly to the remote gpu, bypassing the cpu hence eliminating the idle time. This transforms the process from 2 sided chatty conversation to one sided memory pull but the problem with fragmentation still exists that’s where they brought in the concept of block coalescing. It combines the kv cache blocks from concurrent requests into a single block and as a result we combine these small 4kb blocks into a big block which can be sent over. After searchig a bit I found the llm-d Project The llm-d project for Kubernetes, used by Red Hat, Google, CoreWeave, and IBM is built on this exact principle. It splits prefill/decode into distinct workloads with smart, cache-aware routing. Disaggregated inference and smart KV cache management are clearly the next frontier for scalable, cost-effective LLM serving. What are your thoughts? Have you run into the KV cache bottleneck? #LLM #AI #Inference #DistributedSystems #MLOps #Scalability #KVDirect #vLLM #llmd #Kubernetes #DeepLearning #Google #IBM #RedHat
Like Comment
To view or add a comment, sign in
Shannon Alliance

187 followers
5mo
Report this post
Datalog on the GPU: a game-changer for SMBs? • Semi-naïve Evaluation algorithm reduces redundant work in Datalog rule evaluation • Hash-Indexed Sorted Array data structure enables efficient joins on the GPU • Results show significant performance gains over state-of-the-art CPU implementations Takeaway: Optimizing Datalog for the GPU can unlock new possibilities for SMBs working with large datasets. #datalog #gpuacceleration #datascience

Optimizing Datalog for the GPU danglingpointers.substack.com
Like Comment
To view or add a comment, sign in
Rajesh Iyer
6mo Edited
Report this post
When compiled code met GPUs, data pipelines stopped being a team sport. [Have given up on virtual pooled HBM + NVMe over PCIe + NVMe over RDMA - to addtess OOM problems for now; best to wait for pooled HBM on the next generation Blackwell I guess] ----- If you a ML AI person with deep interest in AI Data Factory concepts, I'd be delighted to connect at my Tuesday PM session on behalf of Capgemini ("Buy More. Spend Less. See Now") at the NVIDIA booth at AWS re:Invent 2025! ---- I can now develop and run enterprise-grade ingest, ETL, warehousing, and BI pipelines for about 10 % of the total cost of the traditional stack. Not a new-platform story — a physics story. 🔹 How the math changed A single H100 GPU on AWS p5e/p5en replaces dozens of CPU servers. Polars + RAPIDS/cuDF execute compiled Rust + CUDA kernels — no JVM, no shuffle, no orchestration drag. Ray spreads the workload across GPUs with a few lines of Python. And Lepton spins those GPU nodes up in seconds and stops them the moment the job finishes. You pay only for the minutes that matter. That’s how runtime cost drops to a fraction of what Snowflake or Databricks consume. 🔹 But the real savings are human In most enterprises, developer and ops time outweigh compute by an order of magnitude. This is where Rust (Polars) quietly changes the economics. Its strong typing, ownership model, and compile-time checks eliminate entire categories of bugs — nulls, type mismatches, race conditions, data leaks. The compiler enforces correctness before a single byte moves, so runtime firefighting simply… disappears. Combined with Polars’ lazy execution, you debug on CPU (cheap) and run once on GPU (fast). You don’t need a platform team to babysit pipelines. No YAML forests. No schema roulette. No post-mortems about “missing commas in JSON.” Just deterministic, type-safe, compiled logic and hardware doing what they’re good at. That’s how total cost lands near 10 % of baseline and delivery cycles shrink from weeks to hours. 🔹 The modern stack in one line Polars + RAPIDS + Ray + RMM on AWS p5e/p5en via Lepton. Ten minutes of runtime. Ten lines of code. Ten percent of the cost. Data engineering doesn’t need bigger teams or bigger clusters. It needs better compilers, stronger typing, safer languages, faster GPUs, and infrastructure that starts and stops instantly. NVIDIA Amazon Web Services (AWS) Capgemini Sangeeta Ron MBA, MTech, CAMS Chirag Thakral Jack-Ryan Ashby Rebecca (Smith) Gentile Tony Santiago Dr.Manikantan Poonkundran Vamsee Krishna Athuluru Tanushree Datta Anupam Singh Gaurav Verma Vishal Dongre Vishal Jayanty Kumar Chinnakali Deepak Juneja Vishal Desai #tPower #ml #ai #GenAI4FS #inc81starch

3 Comments
Like Comment
To view or add a comment, sign in
Rahul Kolekar
6mo
Report this post
Anthropic recently signed for up to 1,000,000 TPUs. The AI bottleneck moved from GPUs to power + portability. On Oct 23, 2025, Google and Anthropic announced a multibillion-dollar deal granting access to up to 1M TPUs, with >1 gigawatt of compute slated to come online in 2026 to train the next Claude models. (Reuters/Bloomberg/AP) Translation: the TPU/XLA stack (JAX, PyTorch-XLA) becomes table-stakes, and capacity planning now looks like power procurement + compiler tuning, not “can we find H100s?” So what: • Infra/ML: Audit TPU-readiness (JAX/PyTorch-XLA, vLLM-TPU), benchmark on representative runs, and plan interconnect + storage for multi-slice training. • Finance/Strategy: Model TCO with power + egress; assume staged delivery from 2026. • Security/Compliance: Revisit data-residency and cross-cloud audit when mixing AWS + GCP. • Product: Expect faster model refresh cycles; design rollouts around compiler upgrades, not just chip SKUs. Caveat: Porting effort, egress costs, and grid/interconnect constraints can erase the theoretical gains if not planned early. (DOE/utility interconnects) Question: If you’ve moved large training runs to TPUs, what broke first—compiler flags, input pipeline, or interconnect bandwidth? #TPUs #ClaudeAI #XLA #MLOps #AIInfrastructure
Like Comment
To view or add a comment, sign in
Dilip Wakdikar
5mo Edited
Report this post
🚀 Big shift ahead in data-engine architecture! While most orgs use JVM-based engines like Apache Spark or Presto, a new player is emerging: Velox — a C++ based execution engine born at Meta and now backed by IBM, Microsoft, NVIDIA, and others that drives your Apache Spark, Presto, etc. Key advantages: - Native C++ means less overhead (no Garbage Collection, Just-in-time compilation) - Built for columnar / vectorised processing (better throughput for analytics) - Low-level control for engine builders 💡 Are we looking at a future where rather than choosing JVM vs C++ as a monolithic decision, we adopt a hybrid architecture — JVM-based engines handle one class of workloads and C++ based native execution (via Velox or similar) another class of workloads? 💭 What are your thoughts on the shift, the challenges and the opportunities? #DataEngineering #watsonx.data #velox #IBM #ExecutionEngine #nvidia #microsoft
Like Comment
To view or add a comment, sign in

924 followers

View Profile Follow

How pg_strom boosts PostgreSQL with NVIDIA GPUs

More Relevant Posts

Explore related topics

Explore content categories