SQL Optimization for MRRS Project

1w Edited

SQL Optimization from "Theoretical" to "Performant" | Tuning the SQL Table for our application (MRRS) We’ve all heard that indices are the answer to slow SQL performance. But as anyone working on enterprise-scale data knows: it’s never that simple. In my recent work on the MRRS project for Network Rail, I ran into a classic bottleneck where standard indexing wasn’t just "not helping", it was actually adding overhead to our ETL cycles. When you're dealing with the scale and precision required for rail infrastructure data, "good enough" isn't an option. I took a deep dive into the execution plans to move beyond the basics. Here’s what worked for us: The Practical Approach: 1. SARGability Over Everything: We found several legacy filters wrapping columns in functions (like YEAR() or TRIM()), which completely blinded our indices. Refactoring these to be SARG-compliant dropped our execution time significantly. 2. Index Pruning: We realized we had "index bloat"—too many overlapping indices that were slowing down our INSERT and UPDATE operations during the daily data refresh. By consolidating into a few high-impact composite indices, we improved write throughput 3. Covering Indices: Instead of just indexing the WHERE clause, we included the SELECT columns in the index itself. This allowed the engine to fulfill the query entirely from the index without ever touching the heavy base tables The Result? By moving away from "blanket indexing" and focusing on precision tuning, we achieved a much leaner, more resilient data pipeline that can handle the complexities of the MRRS project without breaking a sweat. The Lesson: Don't just add an index because it’s there. Read the execution plan, understand the write-penalty, and optimize for the specific data shape you’re working with. #DataEngineering #SQLOptimization #AzureDataFactory #MRRS #DatabaseTuning #SeniorDataEngineer

To view or add a comment, sign in

More Relevant Posts

santhosh S
1w
Report this post
Building data pipelines is one thing. Building pipelines that survive "Schema Drift" is another. 🏗️ You’ve built the perfect automated pipeline in MS SQL Server, optimized every JOIN, and it's running beautifully. Then... the marketing team adds a 'referral_source' column. Or finance renames 'total_rev' to 'final_revenue'. Suddenly, your pipeline crashes. Your overnight jobs fail. This is Schema Drift, and it's one of the most critical challenges in Data Engineering. As I focus on building robust SQL Server architecture, here are 3 essential T-SQL best practices I'm learning to implement to prevent fragile code: 1️⃣ Never use SELECT * in Production: It's a dangerous anti-pattern. Specifying exact column names ensures that if a table gets a new, unexpected column upstream, your stored procedures won't pull the wrong data or break downstream integrations. 2️⃣ Leveraging sys.columns: You can give your T-SQL "self-awareness." By querying the system catalog views like sys.columns and sys.tables, you can dynamically check if a column actually exists before your script tries to use it. 3️⃣ Safe Dynamic SQL: When schemas must be flexible, Dynamic SQL is the answer. But doing it safely by using sys.sp_executesql (instead of just EXEC()) is crucial. It allows you to parameterize your inputs, protecting the database from SQL injection and improving execution plan caching. I'm focused on learning how to build data systems that last, not just scripts that run once. I'd love to hear from experienced SQL Server professionals: how does your team handle schema drift in production? Let's discuss! 👇 #DataEngineering #SQL #MSSQL #SQLServer #DataAnalyst #DatabaseDesign #CodingTips #SanthoshS
Like Comment
To view or add a comment, sign in
Deolinda Dias Rasteiro
1w
Report this post
From conceptual modelling to SQL implementation - this is the focus of our latest article, now available online in Information Systems. In “Automating the generation of database artifacts: from ER+ to SQL”, we explore how the automated generation of database artifacts can support more rigorous, efficient, and accessible information systems development. I had the pleasure of contributing to this work with Gonçalo Carvalho, Nour Dorgham, Bruno Cabral, Jorge Bernardino, and Vasco Pereira. https://lnkd.in/effFu3bM #InformationSystems #Databases #SQL #ConceptualModelling #ERModel #Research #OpenAccess #AcademicResearch

Automating the generation of database artifacts: from ER+ to SQL sciencedirect.com
Like Comment
To view or add a comment, sign in
Zain Ul Abideen
3w
Report this post
💡 What *really* happens when you run an SQL query? Let’s break it down with a simple example: `SELECT name, age FROM users WHERE city = 'New York';` Most developers stop at writing queries. But the real growth starts when you understand what happens *under the hood* 👇 --- ⚙️ **𝗦𝘁𝗲𝗽 𝟭: 𝗧𝗿𝗮𝗻𝘀𝗽𝗼𝗿𝘁 𝗦𝘂𝗯𝘀𝘆𝘀𝘁𝗲𝗺** The moment you hit “Run”, your query doesn’t jump straight into the database. It first lands in the Transport Subsystem — the gatekeeper. ✅ Manages client connections ✅ Authenticates & authorizes requests ✅ Decides whether your query is allowed to proceed --- 🧠 **𝗦𝘁𝗲𝗽 𝟮: 𝗤𝘂𝗲𝗿𝘆 𝗣𝗿𝗼𝗰𝗲𝘀𝘀𝗼𝗿** This is where your SQL gets *understood*. It has two key components: 🔹 **𝗤𝘂𝗲𝗿𝘆 𝗣𝗮𝗿𝘀𝗲𝗿** Breaks your query into parts (SELECT, FROM, WHERE) Checks syntax and builds a parse tree 🔹 **𝗤𝘂𝗲𝗿𝘆 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗲𝗿** Validates tables/columns (semantic checks) Figures out the *most efficient way* to run your query 🎯 Output: An optimized execution plan --- 🚀 **𝗦𝘁𝗲𝗽 𝟯: 𝗘𝘅𝗲𝗰𝘂𝘁𝗶𝗼𝗻 𝗘𝗻𝗴𝗶𝗻𝗲** Now the plan turns into action. The Execution Engine: ✅ Follows the execution plan step-by-step ✅ Coordinates with lower layers ✅ Collects and merges results --- 💾 **𝗦𝘁𝗲𝗽 𝟰: 𝗦𝘁𝗼𝗿𝗮𝗴𝗲 𝗘𝗻𝗴𝗶𝗻𝗲** This is where the actual data work happens. Think of it as a team working behind the scenes: 👨💼 Transaction Manager → ensures consistency 🔒 Lock Manager → prevents conflicts ⚡ Buffer Manager → fetches data from memory/disk 🧾 Recovery Manager → logs for rollback & recovery --- 🔍 The key insight? Your SQL query is not just a command. It’s a *journey through multiple layers of abstraction, optimization, and coordination.* And understanding this is what separates: 👉 Query writers from system thinkers --- 💬 Curious — what else would you add to this journey? #SQL #Databases #BackendEngineering #SystemDesign #SoftwareEngineering
Like Comment
To view or add a comment, sign in
Olusegun Oluyemi
1w
Report this post
Most developers blame slow queries on missing indexes. The real culprit is usually hidden inside the execution plan. After years of tuning SQL Server workloads, I have learned that reading execution plans is the single highest-leverage skill a data engineer can develop. It tells you exactly where SQL Server is spending its time — no guessing required. Here is what I look for first when a query is underperforming: 1. Thick arrows between operators — wide data flows signal excessive row estimates and memory pressure 2. Key Lookups — these often mean a nonclustered index is missing one or two covering columns 3. Hash Matches on large tables — usually a sign of outdated statistics or a missing join index 4. Parallelism warnings — CXPACKET waits visible in the plan indicate skewed data distribution or MAXDOP misconfiguration 5. Estimated vs actual row counts — a significant gap almost always points to stale statistics or parameter sniffing Once you identify the bottleneck operator, the fix is usually surgical. Update statistics, add a covering index, rewrite the predicate, or force a plan hint where justified. You rarely need to rewrite the entire query. Execution plan analysis is not reserved for DBAs. Every engineer who writes T-SQL should be comfortable opening an actual execution plan before escalating a performance issue. Build that habit early and you will resolve most slow query tickets in under thirty minutes. #SQLServer #QueryOptimization #DataEngineering #PerformanceTuning #DatabaseAdministration
Like Comment
To view or add a comment, sign in
Surjit Singh
2w Edited
Report this post
🚀 SQL Optimization Case Study: Fixing Concurrency & Performance in Series Generation Worked on optimizing a stored procedure responsible for generating unique reference numbers in a high-concurrency system. Before (Problem) • Separate SELECT + UPDATE → race condition risk • Multiple IF blocks → duplicate code • NOLOCK → dirty reads / inconsistent data 👉 Result: Duplicate IDs, slow performance, and unreliable behaviour under load. After (Solution) 🔹 Atomic Update (Key Fix) UPDATE Series WITH (UPDLOCK, ROWLOCK) SET CurrentSeries = CurrentSeries + 1 OUTPUT inserted.CurrentSeries ✔ Single operation → no race condition ✔ Ensures thread-safe sequence generation 🔹 Removed Redundant Queries • Eliminated repeated SELECT blocks • Used OUTPUT to fetch updated values directly ✔ Reduced query count ✔ Improved execution speed 🔹 Improved Locking Strategy • Used UPDLOCK → prevents concurrent updates • Removed NOLOCK → avoids dirty reads ✔ Better data consistency + reliability 🔹 Index Optimization CREATE NONCLUSTERED INDEX IX_Series_Type_Active ON Series (SeriesType, IsActive) INCLUDE (CurrentSeries, SeriesUpto, Prefix); ✔ Faster lookup ✔ Reduced table scans 📊 Impact 🚫 Eliminated duplicate reference numbers ⚡ Improved performance under concurrency 🔒 Stronger data integrity 🧩 Cleaner & maintainable code 💡 Takeaway: For high-volume systems, always ensure: • Atomic operations > separate SELECT + UPDATE • Proper locking > NOLOCK shortcuts • Efficient functions > convenience functions 👉 Small SQL changes can create big performance gains. #SQLServer #DatabaseOptimization #Concurrency #PerformanceTuning #BackendEngineering #SystemDesign
Like Comment
To view or add a comment, sign in
Ankit Barik
2w Edited
Report this post
🚀 SQL Indexing: From “it makes queries faster” to actually understanding why For a long time, I used to hear: “Just add an index — your query will be faster.” But I never really understood what actually changes under the hood. Recently, I explored this using EXPLAIN ANALYZE — and the difference was eye-opening. 🧠 Before indexing SELECT * FROM marks WHERE name = 'Chinu'; Execution plan: ➡️ Parallel Sequential Scan - The database scans the entire table - Checks every row - Cost grows linearly with data size ⏱️ Higher execution time as data increases ⚡ After adding an index CREATE INDEX idx_name ON marks(name); Execution plan: ➡️ Index Scan - Uses a B-Tree structure internally - Navigates like a search tree (O(log n)) - Directly jumps to matching rows ⏱️ Significant performance improvement 🔍 Going one step further — Covering Index CREATE INDEX idx_name ON marks(name) INCLUDE (marks); Now for this query: SELECT name, marks FROM marks WHERE name = 'Chinu'; ➡️ Index Only Scan - Required data is already present inside the index - No need to access the main table (heap) - Eliminates extra lookups 💡 What actually changed? - The data didn’t change. - The query didn’t change. 👉 The data access strategy changed. ❌ Sequential Scan → “Check everything” ✅ Index Scan → “Navigate intelligently” 🚀 Index Only Scan → “Don’t even touch the table” ⚠️ Trade-offs Indexes are powerful, but not free: - Additional storage overhead - Slower INSERT / UPDATE operations - Must be designed based on query patterns 📌 Final thought “Indexes don’t just make queries faster — they change how databases think about data access.” Exploring more around execution plans, query optimization, and database internals. #SQL #BackendDevelopment #Database #Performance #LearningInPublic #Developers
Like Comment
To view or add a comment, sign in
Luciana Machado
1mo
Report this post
Your database is not slow. Your queries are. One of the most common things I hear is: “The database is slow.” But most of the time… it isn’t. A while ago, I had to analyze a query that was taking several minutes to run. At first glance, nothing “too wrong”. But digging deeper, the pattern was clear: * Looping through records * Nested subqueries executed per row * Repeated reads over the same tables Classic row-by-row processing. So instead of trying to “tune” the query… I rewrote the approach. From this mindset: FOREACH record RUN subquery To this: WITH AggregatedData AS ( SELECT EventId, SUM(Value) AS Total FROM Items GROUP BY EventId ) SELECT e.Id, a.Total FROM Events e LEFT JOIN AggregatedData a ON a.EventId = e.Id The result: * Query time dropped from minutes to milliseconds * Massive reduction in IO * Stable performance even with data growth That’s when it becomes very clear: * SQL is not about how to iterate It’s about how to describe the result Another common issue I still see: Developers relying on ORM-generated queries without ever checking what is actually executed. ORMs are great. But the database only understands SQL. The real shift happens when you start looking at: * Execution plans * Index usage * Logical reads Because that’s where performance actually lives. The database is rarely the problem. Data access patterns are. Curious to hear: Have you ever rewritten a query and seen a massive performance gain? #SQLServer #DatabasePerformance #BackendDevelopment #SoftwareEngineering #DotNet #SystemDesign #DataEngineering

8 Comments
Like Comment
To view or add a comment, sign in
Venkata Sai Charan Ravipati
1w
Report this post
⚡ The Window Function Trap That's Costing You Accuracy A subtle mistake most SQL writers make without realizing it. When you use ROW_NUMBER() to deduplicate records, the order of ties matters — but most people never specify a deterministic tiebreaker. If two rows share the same PARTITION BY key and the same ORDER BY value, the database engine is free to assign row numbers arbitrarily. Your results become non-reproducible across runs — and in production, that's a silent data quality bug. ❌ WRONG — Non-deterministic ordering: SELECT * FROM ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY user_id ORDER BY created_at DESC -- ⚠️ ties unresolved ) AS rn FROM events ) WHERE rn = 1; ✅ RIGHT — Deterministic tiebreaker: SELECT * FROM ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY user_id ORDER BY created_at DESC, event_id ASC -- ✓ unique tiebreaker ) AS rn FROM events ) WHERE rn = 1; "Non-deterministic ordering is the most common source of flaky deduplication logic in production pipelines." If two rows have the same user_id and created_at, which one survives? Without a tiebreaker — it's random. Run the same query twice and you might get different rows. 🔑 Rule of thumb: Every ROW_NUMBER() OVER clause should end with a column that is globally unique — typically a surrogate key, UUID, or row-level hash. Even if it feels redundant, add it anyway. Reproducibility is a feature, not an accident. Always test your deduplication logic with intentional duplicate timestamps. If the result changes between runs, you're missing a tiebreaker. #SQL #DataEngineering #WindowFunctions #Analytics #DataQuality
Like Comment
To view or add a comment, sign in
Abdul Rehman Khan
2w
Report this post
🚀 SQL Workflow Explained (Step-by-Step) Ever wondered how data actually flows inside a system before you query it? Here’s a simple breakdown of the complete SQL process 👇 🔹 1. Data Source Data comes from APIs, CSV files, applications, or logs. Example: A website collects user signup data. 🔹 2. Data Loading (Ingestion) Data is inserted into the database using SQL queries or ETL tools. 🔹 3. Storage (Tables) Data is stored in structured tables (rows & columns). 🔹 4. Query Processing When you run a query, SQL: ✔ Parses the query ✔ Optimizes it ✔ Executes it 🔹 5. Execution Order (Important 🔥) SQL doesn’t run top-to-bottom. It follows this order: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY 🔹 6. Data Transformation Using SQL operations like: • WHERE (filter) • JOIN (combine tables) • GROUP BY (aggregation) • CASE (logic) 🔹 7. Data Output Final result is returned as a table or aggregated insights. 🔹 8. Real-World Flow 🌍 Data Source → ETL Pipeline → Database → SQL Queries → Dashboard #SQL #DataEngineering #CloudData #ETL #Analytics #BigData #Learning #Tech
Like Comment
To view or add a comment, sign in
Aptuz Technology Solutions Pvt. Ltd.

778 followers
1w
Report this post
🚀 Is your database slowing down as data grows? Large tables can silently kill query performance — but there’s a smarter way to handle it. In this blog, we break down table partitioning — a powerful technique that splits massive tables into smaller chunks without changing how you query them. Learn how Range, List, and Hash partitioning can drastically improve query speed, enable parallel processing, and make your database scale effortlessly. If you're working with growing datasets, this is something you can’t afford to ignore. 👉 Read more to optimize your database like a pro! Read the full article here: https://lnkd.in/gFghUMGZ Article by: Karnati Sivaram

Database Table Partitioning Explained: Boost Query Performance & Handle Big Data Efficiently aptuz.com
Like Comment
To view or add a comment, sign in

1,018 followers

30 Posts

View Profile Follow

SQL Optimization for MRRS Project

More Relevant Posts

Explore related topics

Explore content categories