Rajeev Kumar’s Post

Day 03/30🚀 𝗦𝗤𝗟 𝗞𝗲𝘆𝘀 & 𝗖𝗼𝗻𝘀𝘁𝗿𝗮𝗶𝗻𝘁𝘀 — The Backbone of Reliable Data Most people learn SQL like this: 👉 SELECT, JOIN, WHERE… But in real systems? 👉Keys and constraints decide whether your data can be trusted or not. 🔑 𝗦𝗤𝗟 𝗞𝗘𝗬𝗦 (Define Relationships & Uniqueness) 🔹 𝗣𝗥𝗜𝗠𝗔𝗥𝗬 𝗞𝗘𝗬 (PK) 👉 Uniquely identifies each record 👉 Cannot be NULL or duplicate 🔹 𝗙𝗢𝗥𝗘𝗜𝗚𝗡 𝗞𝗘𝗬 (FK) 👉 Links one table to another 👉 Maintains referential integrity 💡 Think: Orders must belong to a valid customer 🔹 𝗨𝗡𝗜𝗤𝗨𝗘 𝗞𝗘𝗬 👉 Ensures all values are unique 👉 Allows NULL (depends on DB) 💡 Think: Email ID should not repeat 🔹 𝗖𝗢𝗠𝗣𝗢𝗦𝗜𝗧𝗘 𝗞𝗘𝗬 👉 Combination of columns to create uniqueness 💡 Think: (order_id + product_id) 🔹 𝗖𝗔𝗡𝗗𝗜𝗗𝗔𝗧𝗘 𝗞𝗘𝗬 👉 All possible columns that can act as PK 🔹 𝗦𝗨𝗣𝗘𝗥 𝗞𝗘𝗬 👉 Any combination that uniquely identifies a row 🛑 𝗦𝗤𝗟 𝗖𝗢𝗡𝗦𝗧𝗥𝗔𝗜𝗡𝗧𝗦 (Enforce Rules on Data) 🔹 𝗡𝗢𝗧 𝗡𝗨𝗟𝗟 👉 Column cannot store NULL values 🔹 𝗨𝗡𝗜𝗤𝗨𝗘 👉 Prevents duplicate values 🔹 𝗖𝗛𝗘𝗖𝗞 👉 Ensures values meet a condition 💡 Example: age > 0 🔹 𝗗𝗘𝗙𝗔𝗨𝗟𝗧 👉 Assigns default value if none is provided ✅𝗘𝘅𝗮𝗺𝗽𝗹𝗲: CREATE TABLE orders ( order_id INT PRIMARY KEY, customer_id INT, email VARCHAR(100) UNIQUE, amount DECIMAL CHECK (amount > 0), status VARCHAR(20) DEFAULT ‘PENDING’, FOREIGN KEY (customer_id) REFERENCES customers(customer_id) ); 💡𝗪𝗵𝘆 𝘁𝗵𝗶𝘀 𝗺𝗮𝘁𝘁𝗲𝗿𝘀? Without keys & constraints: ❌ Duplicate data ❌ Broken relationships ❌ Invalid values ❌ Unreliable dashboards With them: ✔ Clean data ✔ Trustworthy systems ✔ Strong data models 👉Keys define structure. Constraints enforce discipline. #SQL #DataEngineering #DatabaseDesign #ETL #DataQuality #LearningInPublic

To view or add a comment, sign in

More Relevant Posts

Aastha Ahirkar
3w
Report this post
💻 𝐘𝐨𝐮𝐫 𝐒𝐐𝐋 𝐐𝐮𝐞𝐫𝐲 𝐖𝐨𝐫𝐤𝐬… 𝐁𝐮𝐭 𝐘𝐨𝐮𝐫 𝐋𝐨𝐠𝐢𝐜 𝐌𝐢𝐠𝐡𝐭 𝐁𝐞 𝐖𝐫𝐨𝐧𝐠 👀 Most SQL queries don’t fail. They run. They return results. And that’s exactly where the problem begins. 💡 𝐎𝐧𝐞 𝐨𝐟 𝐭𝐡𝐞 𝐦𝐨𝐬𝐭 𝐢𝐠𝐧𝐨𝐫𝐞𝐝 𝐦𝐢𝐬𝐭𝐚𝐤𝐞𝐬 𝐢𝐧 𝐒𝐐𝐋: 👉 𝗙𝗶𝗹𝘁𝗲𝗿𝗶𝗻𝗴 𝗮𝘁 𝘁𝗵𝗲 𝘄𝗿𝗼𝗻𝗴 𝘀𝘁𝗮𝗴𝗲 🔍 You want to find customers who spent more than 10,000. 𝙎𝙤 𝙮𝙤𝙪 𝙬𝙧𝙞𝙩𝙚: 𝗦𝗘𝗟𝗘𝗖𝗧 𝗰𝘂𝘀𝘁𝗼𝗺𝗲𝗿_𝗶𝗱, 𝗦𝗨𝗠(𝗮𝗺𝗼𝘂𝗻𝘁) 𝗙𝗥𝗢𝗠 𝗼𝗿𝗱𝗲𝗿𝘀 𝗪𝗛𝗘𝗥𝗘 𝗮𝗺𝗼𝘂𝗻𝘁 > 𝟭𝟬𝟬𝟬𝟬 𝗚𝗥𝗢𝗨𝗣 𝗕𝗬 𝗰𝘂𝘀𝘁𝗼𝗺𝗲𝗿_𝗶𝗱; Looks correct… but it’s not ❌ 🧠 What’s going wrong? You filtered data before aggregation 𝗪𝗵𝗶𝗰𝗵 𝗺𝗲𝗮𝗻𝘀: • Smaller transactions are ignored • Total spend becomes inaccurate ✅ 𝙏𝙝𝙚 𝙘𝙤𝙧𝙧𝙚𝙘𝙩 𝙬𝙖𝙮: 𝗦𝗘𝗟𝗘𝗖𝗧 𝗰𝘂𝘀𝘁𝗼𝗺𝗲𝗿_𝗶𝗱, 𝗦𝗨𝗠(𝗮𝗺𝗼𝘂𝗻𝘁) 𝗔𝗦 𝘁𝗼𝘁𝗮𝗹_𝘀𝗽𝗲𝗻𝘁 𝗙𝗥𝗢𝗠 𝗼𝗿𝗱𝗲𝗿𝘀 𝗚𝗥𝗢𝗨𝗣 𝗕𝗬 𝗰𝘂𝘀𝘁𝗼𝗺𝗲𝗿_𝗶𝗱 𝗛𝗔𝗩𝗜𝗡𝗚 𝗦𝗨𝗠(𝗮𝗺𝗼𝘂𝗻𝘁) > 𝟭𝟬𝟬𝟬𝟬; 📉 The scary part? • Your query runs perfectly • No errors at all • But your insights are completely wrong 🚀𝗥𝗲𝗮𝗹 𝗹𝗲𝘀𝘀𝗼𝗻: - WHERE filters rows - HAVING filters aggregated results Understanding when to use them matters more than just knowing how 💬 Most SQL mistakes don’t break your query… They break your logic. #SQL #DataAnalytics #DataAnalyst #LearningSQL #TechCareers #DataThinking 🚀
Like Comment
To view or add a comment, sign in
Ganesh R
1w
Report this post
🚀 𝗦𝗖𝗗 𝗧𝘆𝗽𝗲-1 𝘃𝘀 𝗧𝘆𝗽𝗲-2 𝗶𝗻 𝗗𝗮𝘁𝗮 𝗪𝗮𝗿𝗲𝗵𝗼𝘂𝘀𝗶𝗻𝗴 If you're working in Data Engineering or Analytics, understanding Slowly Changing Dimensions (SCD) is a must! Let’s break down the two most commonly used types 👇 🔹 𝗦𝗖𝗗 𝗧𝘆𝗽𝗲-1 (𝗢𝘃𝗲𝗿𝘄𝗿𝗶𝘁𝗲) • Updates existing records directly • No history is maintained • Old data is lost permanently 👉 𝗘𝘅𝗮𝗺𝗽𝗹𝗲: Customer changes email → Old email is replaced with new email 💡 𝗨𝘀𝗲 𝗖𝗮𝘀𝗲: • When historical data is not important • When you only need the latest snapshot 🔹 𝗦𝗖𝗗 𝗧𝘆𝗽𝗲-2 (𝗛𝗶𝘀𝘁𝗼𝗿𝗶𝗰𝗮𝗹 𝗧𝗿𝗮𝗰𝗸𝗶𝗻𝗴) • Maintains full history of changes • Inserts a new record for every change • Uses columns like: start_date, end_date, current_flag 👉 𝗘𝘅𝗮𝗺𝗽𝗹𝗲: • Customer changes address → • Old record is marked inactive, new record is inserted 💡 𝗨𝘀𝗲 𝗖𝗮𝘀𝗲: • When tracking changes over time is critical • Auditing, reporting, and trend analysis 🎯 𝗖𝗼𝗻𝗰𝗹𝘂𝘀𝗶𝗼𝗻 Choose Type-1 for simplicity and performance Choose Type-2 when history matters 💬 What do you use more in your projects — Type-1 or Type-2? #DataEngineering #DataWarehouse #SCD #ETL #Azure #Databricks #SQL #BigData

49 Comments
Like Comment
To view or add a comment, sign in
David G.
3w
Report this post
CASE statements. I have always had a love/hate relationship with them. Until now. I started just like every other person in data learning to write SQL. I found case statements, and I felt powerful. I could categorize anything. A group of random statuses, I got that! Translating some weird source data, tagging it and making it understandable. I can do that too. Like everything else in data, quick solutions work until they don’t. Stay around long enough and things will change and your CASE statement will not be as pretty as it once was. I have lived this in real life. Multiple times. I had just built (what I thought was anyway) a proper data model to serve some reporting infrastructure. I was really excited to finish. I had taken care to define every relationship, de duplicate, tag in plain English. The whole nine yards. I did it all with…you guessed it…CASE statements. Then, I got another ask. I had to add more data. Doh! That beautiful CASE statement? Broken. There’s got to be a better way. I thought. What if I turn my CASE statement into a dimension table. It clicked! The dimension can replace a case statement (in the application that I was using it). So, I got to work, refactoring the data model armed with a proper dimension table instead of a bloated CASE statement. I took it several steps further and added natural keys and surrogate keys, and adapted a pattern to semantically tag items that were not in the dimension. Today, this dimension table serves as the semantic key registry for the whole semantic model/layer. Do I still use CASE statements? Yes, but they are usually super small and limited in application. For the big stuff? That gets a proper dimension.
Like Comment
To view or add a comment, sign in
Baburao Budireddy
2w
Report this post
I once had a simple-looking problem: 👉 “𝘞𝘦 𝘩𝘢𝘷𝘦 𝘥𝘶𝘱𝘭𝘪𝘤𝘢𝘵𝘦 𝘤𝘶𝘴𝘵𝘰𝘮𝘦𝘳 𝘳𝘦𝘤𝘰𝘳𝘥𝘴. 𝘒𝘦𝘦𝘱 𝘫𝘶𝘴𝘵 𝘰𝘯𝘦 𝘳𝘰𝘸 𝘱𝘦𝘳 𝘤𝘶𝘴𝘵𝘰𝘮𝘦𝘳.” Sounds easy… until someone asks, 💬 “𝘉𝘶𝘵 𝘸𝘩𝘪𝘤𝘩 𝘳𝘰𝘸 𝘴𝘩𝘰𝘶𝘭𝘥 𝘴𝘵𝘢𝘺?” That’s when I realized: 𝗗𝗜𝗦𝗧𝗜𝗡𝗖𝗧 removes duplicates fast, but gives you no control. 𝗚𝗥𝗢𝗨𝗣 𝗕𝗬 does the same job, just more verbose. 𝗥𝗢𝗪_𝗡𝗨𝗠𝗕𝗘𝗥() lets you decide — latest record, highest priority, or whatever business logic you need. Same data. Same result. Radically different control. In SQL, the question isn’t “𝘊𝘢𝘯 𝘐 𝘳𝘦𝘮𝘰𝘷𝘦 𝘥𝘶𝘱𝘭𝘪𝘤𝘢𝘵𝘦𝘴?” It’s “𝘋𝘰 𝘐 𝘤𝘩𝘰𝘰𝘴𝘦 𝘸𝘩𝘢𝘵 𝘴𝘵𝘢𝘺𝘴?” That’s why 𝗥𝗢𝗪_𝗡𝗨𝗠𝗕𝗘𝗥() often wins in production systems. How do you usually handle duplicates in SQL? 👇 🔔 Don't forget to follow Baburao Budireddy for more tech job updates 𝐋𝐢𝐤𝐞 👍| 𝗦𝗔𝗩𝗘 📩 | 𝐑𝐞𝐩𝐨𝐬𝐭 ♻️ | 𝐂𝐨𝐦𝐦𝐞𝐧𝐭 💬 "𝐂𝐅𝐁𝐑" #SQL #DataEngineering #Analytics #Learning #Databases #TechTips #sqlfordataengineers #sqltips
Like Comment
To view or add a comment, sign in
Rajeev Kumar
2w
Report this post
𝗦𝗤𝗟 𝗝𝗢𝗜𝗡𝗦 - Not Just Syntax, It’s Data Storytelling Most people memorize joins. But in real projects…You need to understand what each join is actually telling you. 👇 🔹 𝗜𝗡𝗡𝗘𝗥 𝗝𝗢𝗜𝗡 → Only Matching Data 👉 Returns rows present in both tables 💡 Think: “Only what connects” 📌 Use case: • Customers who placed orders • Valid transactions across systems 🔹𝗟𝗘𝗙𝗧 𝗝𝗢𝗜𝗡 → Keep Everything from Left 👉 All records from left + matching from right 💡 Think: “Left table is my priority” 📌 Use case: • All users + their activity (even if none) • Master data enrichment 🔹𝗟𝗘𝗙𝗧 𝗝𝗢𝗜𝗡 + 𝗡𝗨𝗟𝗟 → Find Missing Data 👉 Filters unmatched records 💡 Think: “What’s missing?” 🔍 📌 Use case: • Customers who never ordered • Records that failed to map 🔹𝗥𝗜𝗚𝗛𝗧 𝗝𝗢𝗜𝗡 → Opposite of LEFT 👉 All records from right + matching from left 💡Rare in real-world (we usually swap tables instead) 🔹𝗥𝗜𝗚𝗛𝗧 𝗝𝗢𝗜𝗡 + 𝗡𝗨𝗟𝗟 → Missing from Left 👉Finds data present in right but not in left 📌 Use case: • Orphan records • Data mismatch validation 🔹𝗙𝗨𝗟𝗟 𝗢𝗨𝗧𝗘𝗥 𝗝𝗢𝗜𝗡 → Everything from Both 👉Combines all records 💡Think: “Complete picture” 🧩 📌 Use case: • Data comparison • Merging datasets 🔹𝗙𝗨𝗟𝗟 𝗝𝗢𝗜𝗡 + 𝗡𝗨𝗟𝗟 → Differences Only 👉 Keeps only unmatched records 💡 Think: “Audit mode ON” ⚡ 📌 Use case: • Data reconciliation • Debugging pipelines 👉 Joins don’t combine tables… they define relationships. Follow for more real-world SQL & data engineering content 🚀 #SQL #DataEngineering #Analytics #LearnSQL #DataPipeline #TechCareer
2 Comments
Like Comment
To view or add a comment, sign in
Ahmed Shifa
1w
Report this post
Salam! Window functions changed how I think about data quality. I used to spot duplicates by writing messy self-joins. I'd catch outliers with manual queries. It was time consuming and inefficient. Then I learned window functions. Now I use ROW_NUMBER() to deduplicate cleanly: SELECT order_id, customer_id, amount, created_at FROM ( SELECT order_id, customer_id, amount, created_at, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY created_at DESC) as row_num FROM orders ) ranked WHERE row_num = 1; One pass. No self-joins. No mistakes. LAG() and LEAD() help me spot outliers: sudden spikes in order volume, unexpected gaps in transaction history, values that don't make sense compared to the row before or after. RANK() and DENSE_RANK() help me find the top customers by spend, without losing track of ties. But here's what I didn't expect: better queries didn't just make me faster. They made my data more trustworthy. Because when stakeholders ask "how did you get that number?", I can explain it cleanly. No "well, I joined the table to itself and then filtered out the nulls and then..." Window functions aren't just a SQL trick. They're a trust mechanism. What's one SQL feature you didn't appreciate until you really needed it? #DataEngineering #SQL #WindowFunctions #DataQuality #AnalyticsEngineering Wasalam!
Like Comment
To view or add a comment, sign in
Harish Chatla
2w
Report this post
THIS QUERY LOOKS CORRECT. IT IS NOT. It runs. It returns numbers. But every metric is misleading. Business problem: Calculate per team: - Total tickets - Resolved tickets - Resolution rate (%) - On-time resolution rate (%) (within 24 hours) - Average resolution time (hours) - Quality score (%) based on feedback Tables involved: tickets - ticket_id - team_id - created_time - resolved_time feedback - feedback_id - ticket_id - rating - feedback_time At first glance, this looks like simple aggregation. But real data breaks this logic: - One ticket can have multiple feedback entries - Joins inflate counts silently - On-time % should be based only on resolved tickets - Average resolution time should exclude unresolved tickets - Quality % should use only the latest feedback per ticket So even if your query runs, your metrics are lying. Think before answering: Are you counting tickets… or counting duplicated rows? Are your percentages using the correct denominator? Fix the logic, not just the syntax. Comment your answer. Repost if this made you think. Follow Harish Chatla for real-world SQL problems. Subscribe to practice on our platform. #DataRejected #SQL #DataEngineering #Analytics #DataScience #SLA #DataQuality #InterviewPrep #CodingPractice
9 Comments
Like Comment
To view or add a comment, sign in
Harish Chatla
2w
Report this post
THIS QUERY LOOKS CORRECT. IT IS NOT. It runs. It returns clean percentages. But the logic is WRONG. Business problem: Calculate per team: - Resolution rate (%) - On-time resolution rate (%) (within 24h SLA) - Quality score (%) based on feedback Tables involved: tickets - ticket_id - team_id - created_time - resolved_time feedback - feedback_id - ticket_id - rating - feedback_time At first glance, this looks simple. But real data breaks this logic: - Unresolved tickets are included in calculations - SLA is applied incorrectly - Multiple feedback rows inflate quality score - Latest feedback per ticket is not considered So even if your query runs, your numbers are misleading. Think before answering: Are you using the right denominator? Are you counting only resolved tickets? Are duplicates affecting your results? Fix the logic, not just the syntax. Comment your answer. Repost if this made you think. Follow for real-world SQL problems. Subscribe to practice on our platform. #DataRejected #SQL #DataEngineering #Analytics #DataScience #InterviewPrep #CodingPractice #BusinessLogic
8 Comments
Like Comment
To view or add a comment, sign in
Harish Chatla
2w Edited
Report this post
THIS QUERY LOOKS CORRECT. IT IS NOT. Most people think this query is correct. It runs. It returns results. But the logic is completely broken. Business problem: Find the latest product review for each customer based on their most recent completed order. Tables involved: orders - order_id - customer_id - order_date payments - payment_id - order_id - status - payment_date reviews - review_id - order_id - review_text - review_date At first glance, the logic seems simple: Join orders → payments → reviews and pick the latest order per customer. But real data doesn’t behave like that. - One order can have multiple payments - One order can have multiple reviews (edits / updates) - Joins create duplicate rows - “Latest” becomes ambiguous if not handled carefully So even if your query runs, you might be picking the wrong review. Think before answering: Are you selecting the latest order? Or the latest review? Or a random row created by joins? Fix the logic, not just the syntax. Comment your answer. Repost if this made you think. Follow Harish Chatla more real-world data problems. Subscribe to practice on our platform. #DataRejected #SQL #DataEngineering #DataAnalytics #DataScience #LearnByDoing #TechCareers #Analytics #CodingPractice
11 Comments
Like Comment
To view or add a comment, sign in
Aastha Ahirkar
2w
Report this post
📊 𝐂𝐎𝐔𝐍𝐓 𝐢𝐧 𝐒𝐐𝐋: 𝐒𝐦𝐚𝐥𝐥 𝐃𝐞𝐭𝐚𝐢𝐥, 𝐁𝐢𝐠 𝐈𝐦𝐩𝐚𝐜𝐭 Counting seems like the easiest operation in SQL. But this is exactly where many analyses quietly go wrong. 𝐂𝐎𝐔𝐍𝐓(*) 𝐜𝐨𝐮𝐧𝐭𝐬 𝐚𝐥𝐥 𝐫𝐨𝐰𝐬. 𝐂𝐎𝐔𝐍𝐓(𝐜𝐨𝐥𝐮𝐦𝐧) 𝐜𝐨𝐮𝐧𝐭𝐬 𝐨𝐧𝐥𝐲 𝐧𝐨𝐧-𝐍𝐔𝐋𝐋 𝐯𝐚𝐥𝐮𝐞𝐬. At first, the difference feels small. In real data, it’s not. 💡𝐖𝐡𝐚𝐭 𝐚𝐜𝐭𝐮𝐚𝐥𝐥𝐲 𝐡𝐚𝐩𝐩𝐞𝐧𝐬? In most datasets, missing values (NULLs) are common. When you use COUNT(column), SQL automatically ignores those NULLs. • You’re no longer counting rows. • You’re counting available values. And that difference matters more than it seems. ⚠️𝐖𝐡𝐲 𝐭𝐡𝐢𝐬 𝐜𝐫𝐞𝐚𝐭𝐞𝐬 𝐩𝐫𝐨𝐛𝐥𝐞𝐦𝐬 • KPIs get undercounted • Conversion rates become inaccurate • Data completeness is misunderstood 𝐄𝐱𝐚𝐦𝐩𝐥𝐞: If 100 users exist but only 80 have values, COUNT(column) = 80 👉 It may look like only 80 records exist — but that’s not true. 🚀𝐖𝐡𝐚𝐭 𝐚 𝐠𝐨𝐨𝐝 𝐚𝐧𝐚𝐥𝐲𝐬𝐭 𝐝𝐨𝐞𝐬 • Understands the data before counting • Checks for NULL values explicitly • Chooses COUNT logic based on the problem #SQL #DataAnalytics #DataAnalyst #LearningSQL #SQLConcepts #DataCleaning
Like Comment
To view or add a comment, sign in

3,714 followers

45 Posts

View Profile Follow

Rajeev Kumar’s Post

More Relevant Posts

Explore content categories