Cut a PostgreSQL job from 10–12 hours down to ~2hrs 🚀 This was a data loading + cleaning workflow, so the focus was on reducing unnecessary overhead and optimizing execution paths. Key optimizations that worked: 🔹 Used UNLOGGED tables for staging Instead of creating a logged table with type casting on existing columns, I switched to an UNLOGGED table and created new columns with proper casting (e.g., dates). → Reduced WAL overhead significantly for non-critical data. 🔹 Tuned session-level settings Increased work_mem Set synchronous_commit = off (only for this session) → Improved intermediate operations and write performance. 🔹 Optimized indexing strategy Created indexes in the order of joins Focused only on columns used in joins and filters → Avoided index bloat and improved query planning. 🔹 Avoided unnecessary indexing Indexing everything is tempting—but selective indexing made a big difference in execution time. Takeaway: Small, context-aware changes in Postgres can lead to massive performance gains—especially in ETL or staging workloads. #PostgreSQL #SQL #DatabaseOptimization #QueryOptimization #DataEngineering #PerformanceTuning #DatabasePerformance #ETL #DataProcessing #SoftwareEngineering #TechTips #Programming #Engineering #Indexing #SQLPerformance #DataWorkflows #Scalability
Optimize PostgreSQL Job from 10-12 Hours to 2 Hours
More Relevant Posts
-
Hot take: Most partitioned Postgres tables shouldn't be partitioned. We know. Controversial. Let us explain. Partitioning has become the default recommendation for any table over a certain size. "It's 100GB? Partition it." "Queries are slow? Partition it." "Scaling up? Partition it." But partitioning isn't free. It adds planning overhead. It complicates migrations. It makes some queries faster and others slower. And if your partition key doesn't match your query patterns, you've just turned one table into dozens of tables that Postgres has to scan one by one. That's not optimization. That's self-inflicted complexity. So here's the Data Drop #9 framework by Bhupathi Shameer — partition when: ✅ Your queries consistently filter by a predictable key (time range, tenant ID) ✅ You need to archive or drop old data without expensive DELETE operations ✅ Maintenance on the full table (VACUUM, REINDEX) is no longer manageable ✅ You can clearly articulate which partitions most queries will hit Don't partition when: ❌ You're hoping it'll magically speed up queries you haven't profiled yet ❌ Your queries don't filter by the partition key ❌ Your table is large but your actual problem is missing indexes or bad query plans ❌ You're adding it because a blog post said "partition everything over 10GB" The line between "this will save us" vs "this will haunt us" is thinner than most teams think. #AprilDataDrops #PostgreSQL #DataDrop9 #Partitioning #Database #Performance #DataModeling #OpenSourceDB
To view or add a comment, sign in
-
🔥 Ever spent hours debugging a pipeline because your total user count mysteriously dropped by 20%, only to find out a single line of SQL was to blame? If you write SQL heavily in your ETL jobs, you’ve probably fallen into this trap: The Accidental Inner Join. It happens when you write a LEFT JOIN intending to keep all records from your primary table, but then you filter the joined table in the WHERE clause. Here is what this looks like in a real production scenario: Imagine you are using Spark SQL to join a massive users table with a transactions table. You want a list of all users, plus the dates of any completed transactions. * The Mistake (Filtering in WHERE): %SQL SELECT u.user_id, t.transaction_date FROM users u LEFT JOIN transactions t ON u.user_id = t.user_id WHERE t.status = 'COMPLETED' What happens? The LEFT JOIN correctly includes users with no transactions (leaving the status column as NULL). But the WHERE clause runs after the join. Since NULL = 'COMPLETED' evaluates to false, Spark drops every user who hasn't made a purchase. Your LEFT JOIN just silently became an INNER JOIN. * The Fix (Filtering in ON): %SQL SELECT u.user_id, t.transaction_date FROM users u LEFT JOIN transactions t ON u.user_id = t.user_id AND t.status = 'COMPLETED' What happens? The filter is applied during the join. You get all your users, and only the 'COMPLETED' transactions are matched to them. Why this matters in real data engineering: Syntax errors are easy; your PySpark job will just fail and alert you. Logical errors are dangerous. The query above is perfectly valid SQL, meaning your AWS Glue job will happily succeed, write the data to S3, and trigger downstream dependencies. You won't know there's a problem until a frantic product manager pings you saying the active user metrics are broken. Data completeness is just as important as pipeline speed. Always double-check where your filters live. #SQL #MYSQL #DataEngineering #PySpark #AWS #DeltaLake #BigData #ETL #DataPipelines #DataLake #ApacheSpark #CloudComputing #DataEngineeringLife #AnalyticsEngineering #DataPlatform #OpenData #TechCareers
To view or add a comment, sign in
-
-
📊 **Leveling Up My Database Skills with PostgreSQL!** Today, I worked on structuring and managing user data in PostgreSQL. Creating clean, well-organized tables is a foundational step toward building reliable applications and data-driven systems. 🔍 **What this table represents:** * User profiles with name, email, age, and city * Consistent formatting and data types * A scalable structure ready for queries, filters, and analytics Every dataset—no matter how small—is an opportunity to practice data modeling, enhance query performance, and strengthen backend skills. 💡 *Small steps in SQL lead to big wins in development.* #PostgreSQL #SQL #DatabaseDesign #BackendDevelopment #DataEngineering #LearningJourney #TechSkills #Productivity
To view or add a comment, sign in
-
-
Database migrations don't have to be a war of attrition. We just published a detailed case study on migrating Microsoft's WideWorldImporters OLTP database from SQL Server to PostgreSQL 15 using AI agents. The schema wasn't trivial: 26 sequences, 21 tables, 47 stored procedures, and all the edge cases that come with real enterprise workloads, including temporal tables, computed columns, and complex JSON update patterns. Instead of manual conversion, we built a toolchain of Claude Code slash commands, each handling a discrete stage of the migration pipeline. The result was repeatability at a level manual migrations simply can't match. Every transformation decision was captured in companion audit files. Smoke tests validated each function within minutes against live PostgreSQL containers. And the pipeline scales to hundreds of objects without accumulating technical debt. 🔹 Dependency analysis and DDL conversion handled by purpose-built slash commands ⚡ 45 stored procedures translated from MSSQL OPENJSON patterns to PostgreSQL jsonb_to_record 🧠 Temporal tables, computed columns, and PostGIS types converted with documented semantic decisions 🚀 Container-based smoke tests validated every function before any code was committed ✅ Full audit trail: every decision captured in reviewable markdown files #AgenticAI #EnterpriseArchitecture #PostgreSQL #DatabaseMigration Tagging some brilliant minds in this space: Rajagopal Nair Arvind Mehrotra Dr. Anil Kumar P Pradeep Chandran Rashid Siddiqui Ancy Paul
To view or add a comment, sign in
-
-
A lot of beginners underestimate databases. But every developer should understand: - SQL vs NoSQL - Indexes - Foreign keys - JOINs - Query performance Database knowledge is repeatedly highlighted as a core skill alongside deployment, caching, and CI/CD because modern systems depend on good data modeling and efficient queries. #Database #SQL #NoSQL #BackendDevelopment #DataEngineering
To view or add a comment, sign in
-
Is your query actually slow, or just long-running? This distinction is often misunderstood but critical for effective tuning. In my recent blog I have explained how to identify and analyze query behavior correctly - https://lnkd.in/d4y_JJP4 Would love your thoughts and experiences! #PostgreSQL #PerformanceTuning #DBA #SQL #TechBlog #Opensource
To view or add a comment, sign in
-
PostgreSQL is often treated as “just” a relational database. But the more interesting question is usually not SQL vs NoSQL. It is this: What consistency model and scaling model does the system actually need? By understanding the tradeoffs: 🎯 what ACID really gives you 🎯 what BASE really means in practice 🎯 why read replicas are often the first compromise 🎯 why sharding is not replication 🎯 how internet-scale systems changed database architecture 🎯 why many teams eventually still wanted SQL, transactions, and mature tooling 🎯 and why PostgreSQL became such an interesting hybrid answer If you work with PostgreSQL beyond basic CRUD, this presentation should give you a cleaner way to think about consistency, scaling, and architecture decisions. #PostgreSQL #DatabaseArchitecture #SQL #NoSQL #ACID #BASE #DistributedSystems #Backend
To view or add a comment, sign in
-
Every performance boost comes with a durability bill. In PostgreSQL, unlogged tables are a clear example of that unlogged table skips WAL logging. Writes are faster, but if Postgres crashes, the data is gone. Perfect for caches, ETL staging, or derived data. A terrible idea for anything you’d regret losing. Good architecture isn’t about avoiding trade-offs. It’s about making them intentionally. #PostgreSQL #SystemDesign #DatabaseInternals #PerformanceEngineering #SoftwareArchitecture #BackendEngineering #Scalability
To view or add a comment, sign in
-
-
📢 Day 31 — ROLLUP: Hierarchical Aggregation ROLLUP creates multiple levels of totals. It is useful for hierarchical summaries. 📌 Syntax SELECT column1, column2, SUM(value) FROM table GROUP BY ROLLUP(column1, column2); 📌 Example SELECT department, job_title, SUM(salary) FROM employees GROUP BY ROLLUP(department, job_title); 🛠 Practical Uses ✔️ Department totals ✔️ Subtotals in reports #SQL #DataAnalytics #DataEngineering #Database #Programming #Tech #Developers #Learning #DataScience #DataAnalyst #MachineLearning #BigData #BusinessIntelligence #ETL #DataVisualization #DataWarehouse #CareerGrowth #SQLDeveloper #DatabaseDeveloper #DatabaseAdministrator #DataEngineer #BIDeveloper #SQLServer #PostgreSQL #MySQL #Oracle #Snowflake #BigQuery #SparkSQL #TechCommunity #ITProfessionals #ProfessionalGrowth #Networking #LinkedInLearningData
To view or add a comment, sign in
-
🚀 Advanced Database Design in PostgreSQL 📌 1. JSON / JSONB (Flexible Data Modeling) PostgreSQL allows semi-structured data: ✔ JSON → text-based ✔ JSONB → binary, faster & indexable Powerful features: ✔ ->, ->> → access fields ✔ @> → search inside JSON ✔ jsonb_set → update values ✔ json_agg, json_build_object → API-ready responses 📌 2. Transactions (ACID 🔐) Ensure safe and reliable operations: ✔ BEGIN → start ✔ COMMIT → save ✔ ROLLBACK → undo 📌 3. Savepoints (Partial Rollback) Control transactions like a pro: ✔ Create checkpoints ✔ Rollback specific steps only 📌 4. Partitioning (Handle Big Data ⚡) Split large tables for better performance: ✔ LIST → specific values (e.g., class) ✔ RANGE → ranges (e.g., age, date) ✔ HASH → even distribution ✔ Composite → multi-level partition 📌 5. Scheduling with pg_cron Automate database tasks: ✔ Cleanup old data ✔ Run periodic jobs ✔ Reduce manual work 📌 6. Migrations (Schema Versioning) Treat DB like code: ✔ Track changes ✔ Safe deployments ✔ Team collaboration 📌 7. Schema Evolution (DDL Operations) Modify structure safely: ✔ Rename table ✔ Rename column ✔ Add/remove fields 💬 Final Insight: Advanced DB design is about: ⚡ Scalability (partitioning) ⚡ Flexibility (JSONB) ⚡ Reliability (transactions) ⚡ Maintainability (migrations) #PostgreSQL #DatabaseDesign #BackendDevelopment #SystemDesign #SQL #SoftwareEngineering
To view or add a comment, sign in
-
Explore related topics
- How to Optimize Postgresql Database Performance
- Tips for Database Performance Optimization
- How Indexing Improves Query Performance
- How to Optimize SQL Server Performance
- How to Improve NOSQL Database Performance
- How to Optimize Query Strategies
- How to Optimize Cloud Database Performance
- How to Analyze Database Performance
- How to Simplify Data Operations
- How to Streamline ETL Processes
Explore content categories
- Career
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Hospitality & Tourism
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development