Darya S.’s Post

Why do analysts spend 80% of their time on data preparation? ⏳ I recently faced a classic challenge: a CRM export full of "surprises." The Revenue column was a mess — a mix of numbers, currency symbols, "unknown" strings, and missing values. Running any calculation on this raw data would result in an immediate error. The Problem: Convert the data to a numeric type, strip out noise, and handle missing values without losing data or heavily skewing the statistics. ______________________ My Python Solution: import pandas as pd import numpy as np # Example of "dirty" data data = {'revenue': ['100$', ' 150 ', 'unknown', '1,200.50', None]} df = pd.DataFrame(data) # 1. Clean noise: keep only digits and the decimal point # Note: I'm assuming a dot is the decimal separator here df['revenue_clean'] = df['revenue'].str.replace(r'[^\d.]', '', regex=True) # 2. Convert to numeric (non-parseable values become NaN) df['revenue_clean'] = pd.to_numeric(df['revenue_clean'], errors='coerce') # 3. Handle missing values using median imputation # This preserves sample size and is less sensitive to outliers than the mean median_val = df['revenue_clean'].median() df['revenue_clean'] = df['revenue_clean'].fillna(median_val) print(df) ______________________ Why this approach? Regex Flexibility: It allows cleaning most currency formats in a single line. Strategic Coercion: Using errors='coerce' in to_numeric is a lifesaver. It systematically turns "garbage" strings into NaN, which Pandas handles natively. Median vs Mean: In financial data, outliers are common. Median imputation helps maintain the distribution better than a simple average. The Result: A reliable dataset ready for a dashboard or a deep dive. What are your go-to data cleaning methods? Let's discuss in the comments! 👇 #dataanalysis #python #pandas #datacleaning #analytics #datascience #sql #LinkedIn #analytics

To view or add a comment, sign in

More Relevant Posts

Poornachandra Kongara
1w
Report this post
The best data analysts are not the ones who know more. They are the ones who know exactly what to do when a question hits their desk. Because in analytics, the bottleneck is rarely the data. It is the analyst pausing at "what method do I even use here?" Here are the 60 most important data analysis tips covering use cases, methods, SQL, and Python 👇 ✅ Use Cases - what to apply, when ↳ Predict customer churn → logistic regression or gradient boosting on behaviour features. ↳ Segment customers → k-means or RFM analysis. ↳ Forecast sales → ARIMA, Prophet, or Holt-Winters. ↳ Detect fraud → anomaly detection (Isolation Forest, autoencoders). ↳ Measure retention → cohort analysis tracking repeat activity. ↳ Optimise pricing → model price elasticity with regression. ↳ Decide between two options → A/B test with power analysis for sample size. ✅ Methods - the right statistical move ↳ Check if a difference is real → t-test or chi-square + p-value & effect size. ↳ Compare 3+ groups → ANOVA. ↳ Avoid overfitting → cross-validation, regularisation (L1/L2), holdout sets. ↳ Measure classifier performance → precision, recall, F1, ROC-AUC (not just accuracy). ↳ Prove causation → randomised experiments, diff-in-diff, instrumental variables. ↳ Detect outliers → IQR, z-scores, or Isolation Forest. ✅ SQL - the queries that separate juniors from seniors ↳ Rank rows in a group → ROW_NUMBER() / RANK() OVER (PARTITION BY). ↳ Running totals → SUM(x) OVER (ORDER BY date). ↳ Compare to prior period → LAG() / LEAD(). ↳ Simplify long queries → break logic into named CTEs. ↳ Deduplicate rows → ROW_NUMBER() OVER (PARTITION BY key) = 1. ↳ Speed up slow queries → read EXPLAIN plan, index JOIN/WHERE cols, avoid SELECT *. ✅ Python - the toolkit that ships work fast ↳ Load any data → pd.read_csv / read_parquet / read_sql. ↳ Handle missing values → df.fillna(), df.dropna(), SimpleImputer. ↳ Aggregate by group → df.groupby('col').agg(). ↳ Large datasets → use Polars, Dask, or DuckDB instead of pure Pandas. ↳ Explain predictions → SHAP or permutation importance. Save this. Revisit it the next time you are stuck on a problem. ♻️ Repost to help another analyst sharpen their toolkit.
61 Comments
Like Comment
To view or add a comment, sign in
MOHAMUD ABDULLAHI MOHAMED
3w
Report this post
📊 Quartile Deviation: The Outlier Shield in Statistics Moving beyond averages, we enter the world of Measures of Dispersion—how spread out data really is. 1️⃣ What is Quartile Deviation? Unlike the Range (which only considers the highest and lowest values), Quartile Deviation (QD) focuses on the middle 50% of your data. This makes it more robust, as it ignores extreme outliers. Formula: Interquartile Range (IQR) = Q3 – Q1 Quartile Deviation (QD) = IQR ÷ 2 2️⃣ Why Use It? Skewed Data: Resistant to extreme values. Open-Ended Distributions: Useful when top/bottom categories are undefined. Box Plot Backbone: QD defines the width of the box in a box-and-whisker plot. 👉 Key Point: QD acts as an outlier shield. If one house costs $50M, the mean and standard deviation explodes but QD remains stable, reflecting the middle of the market. 🔧 How to Calculate Quartile Deviation in Stata, R, and Python Stata // Load your dataset use dataset.dta, clear // Generate quartiles summarize varname, detail // Quartile Deviation = (Q3 - Q1) / 2 display (r(p75) - r(p25)) / 2 R # Sample data data <- c(12, 15, 18, 20, 22, 25, 30) # Quartiles Q1 <- quantile(data, 0.25) Q3 <- quantile(data, 0.75) # Quartile Deviation QD <- (Q3 - Q1) / 2 QD # Box Plot Visualization boxplot(data, main="Boxplot of Data", col="lightblue") Python (using NumPy & Matplotlib) import numpy as np import matplotlib.pyplot as plt data = [12, 15, 18, 20, 22, 25, 30] Q1 = np.percentile(data, 25) Q3 = np.percentile(data, 75) QD = (Q3 - Q1) / 2 print("Quartile Deviation:", QD) # Box Plot Visualization plt.boxplot(data) plt.title("Boxplot of Data") plt.show() 🚀 Takeaway Quartile Deviation is a powerful, resilient measure of spread especially when dealing with skewed or messy data. It’s simple to compute and highly relevant for applied research, monitoring, and evaluation.
Like Comment
To view or add a comment, sign in
KAVIRAJ T.U
5d
Report this post
📢⚡𝐒𝐚𝐥𝐭𝐢𝐧𝐠 𝐓𝐞𝐜𝐡𝐧𝐢𝐪𝐮𝐞 𝐢𝐧 𝐒𝐩𝐚𝐫𝐤 📍 𝐅𝐢𝐱𝐢𝐧𝐠 𝐃𝐚𝐭𝐚 𝐒𝐤𝐞𝐰 𝐢𝐧 𝐉𝐨𝐢𝐧𝐬 𝐟𝐨𝐫 𝐁𝐞𝐭𝐭𝐞𝐫 𝐏𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞 👉 Facing slow joins even after optimizations? 👉 Data skew might still be hurting you. 👉 One powerful fix: 𝐒𝐚𝐥𝐭𝐢𝐧𝐠 🔑 𝐖𝐡𝐚𝐭 𝐢𝐬 𝐒𝐚𝐥𝐭𝐢𝐧𝐠? 👉 𝐒𝐚𝐥𝐭𝐢𝐧𝐠= Adding a random key (salt) to distribute skewed data 👉 Helps spread heavy keys across partitions 💡 Insight: ✔️ Break one heavy key → into multiple smaller keys 🔑 𝐖𝐡𝐲 𝐒𝐚𝐥𝐭𝐢𝐧𝐠 𝐌𝐚𝐭𝐭𝐞𝐫𝐬? 👉 Without salting: ✔️ One partition overloaded ✔️ Straggler tasks ✔️ Slow execution 👉 With salting: ✔️ Balanced workload ✔️ Parallel processing ✔️ Faster joins 🔑 𝐇𝐨𝐰 𝐈𝐭 𝐖𝐨𝐫𝐤𝐬 👉 Original key: ✔️ user_id = 123 (very frequent) 👉 After salting: ✔️ 123_1, 123_2, 123_3 … 👉 Data spreads across partitions 🔑 𝐄𝐱𝐚𝐦𝐩𝐥𝐞 (𝐏𝐲𝐒𝐩𝐚𝐫𝐤) from pyspark.sql.functions import col, concat, lit, rand df1 = df1.withColumn("salt", (rand()*5).cast("int")) df1 = df1.withColumn("salted_key", concat(col("id"), lit("_"), col("salt"))) ✔️ Apply same logic to other dataset before join 🔑 𝐖𝐡𝐞𝐧 𝐭𝐨 𝐔𝐬𝐞 𝐒𝐚𝐥𝐭𝐢𝐧𝐠 👉 Severe data skew 👉 Highly frequent keys 👉 Large join operations ✔️ Especially useful when skew can't be avoided 🔑 𝐑𝐞𝐚𝐥-𝐖𝐨𝐫𝐥𝐝 𝐔𝐬𝐞 𝐂𝐚𝐬𝐞 👉 User activity logs: ✔️ Few users generate huge data ✔️ Salting distributes those heavy users 🔑 𝐑𝐞𝐚𝐥 𝐂𝐡𝐚𝐥𝐥𝐞𝐧𝐠𝐞𝐬 👉 Increases data size 👉 More complex logic 👉 Needs post-processing ✔️ Trade-off between complexity vs performance 🔑 𝐏𝐫𝐨𝐝𝐮𝐜𝐭𝐢𝐨𝐧 𝐁𝐞𝐬𝐭 𝐏𝐫𝐚𝐜𝐭𝐢𝐜𝐞𝐬 ✔ Apply only on skewed keys ✔ Choose optimal salt range (not too high) ✔ Combine with AQE/skew join handling ✔ Validate performance improvements 🔑 𝐂𝐨𝐦𝐦𝐨𝐧 𝐌𝐢𝐬𝐭𝐚𝐤𝐞𝐬 ❌ Salting entire dataset unnecessarily ❌ Using too many salt values ❌ Not re-aggregating after join ❌ Ignoring built-in skew optimizations 🔑 𝐅𝐢𝐧𝐚𝐥 𝐓𝐚𝐤𝐞𝐚𝐰𝐚𝐲 👉 Salting is a manual but powerful skew solution 👉 Distribute the problem → Solve it faster 👉 Because: ✔ Balanced data = Efficient Spark jobs #Spark #PySpark #BigData #DataEngineering #DataSkew #PerformanceTuning #ApacheSpark #ETL #DistributedSystems
4 Comments
Like Comment
To view or add a comment, sign in
Praveen Patel
4w
Report this post
𝗧𝗼𝗽 𝗣𝘆𝗦𝗽𝗮𝗿𝗸 𝗙𝘂𝗻𝗰𝘁𝗶𝗼𝗻𝘀 𝗘𝘃𝗲𝗿𝘆 𝗔𝘇𝘂𝗿𝗲 𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀 𝗦𝗲𝗻𝗶𝗼𝗿 𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿 𝗦𝗵𝗼𝘂𝗹𝗱 𝗞𝗻𝗼𝘄 1️⃣ 𝗗𝗮𝘁𝗮𝗙𝗿𝗮𝗺𝗲 𝗢𝗽𝗲𝗿𝗮𝘁𝗶𝗼𝗻𝘀 → select() – Pick specific columns efficiently. → filter() / where() – Filter rows with conditions. → withColumn() – Create or modify columns dynamically. → drop() – Remove unnecessary columns for cleaner data. 2️⃣ 𝗔𝗴𝗴𝗿𝗲𝗴𝗮𝘁𝗶𝗼𝗻𝘀 & 𝗚𝗿𝗼𝘂𝗽𝗶𝗻𝗴 → groupBy().agg() – Aggregate by key columns. → count(), sum(), avg(), max(), min() – Common aggregation functions. → rollup() / cube() – Multi-level aggregations for advanced analysis. 3️⃣ 𝗝𝗼𝗶𝗻𝘀 & 𝗠𝗲𝗿𝗴𝗲𝘀 → join() – Inner, outer, left, right joins for combining datasets. → broadcast() – Optimize joins with small tables. → union() / unionByName() – Merge DataFrames vertically. 4️⃣ 𝗪𝗶𝗻𝗱𝗼𝘄 𝗙𝘂𝗻𝗰𝘁𝗶𝗼𝗻𝘀 → row_number(), rank(), dense_rank() – Generate row-based ranking. → lead() / lag() – Access previous or next row values. → window.partitionBy().orderBy() – Powerful for time-series or grouped analysis. 5️⃣ 𝗦𝘁𝗿𝗶𝗻𝗴 & 𝗗𝗮𝘁𝗲 𝗙𝘂𝗻𝗰𝘁𝗶𝗼𝗻𝘀 → substring(), concat(), split(), trim() – Clean and manipulate text. → to_date(), date_format(), datediff() – Handle timestamps and date differences. → regexp_replace() / regexp_extract() – Regex-based transformations. 6️⃣ 𝗣𝘆𝗦𝗽𝗮𝗿𝗸 𝗦𝗤𝗟 𝗙𝘂𝗻𝗰𝘁𝗶𝗼𝗻𝘀 → expr() – Write SQL expressions within PySpark. → when() / otherwise() – Conditional transformations. → lit() – Add constant values as columns. 7️⃣ 𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 & 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗙𝘂𝗻𝗰𝘁𝗶𝗼𝗻𝘀 → repartition() / coalesce() – Control parallelism and shuffle. → cache() / persist() – Store intermediate results for reuse. → explain() – Understand query plans and optimize performance. 8️⃣ 𝗔𝗱𝘃𝗮𝗻𝗰𝗲𝗱 𝗙𝘂𝗻𝗰𝘁𝗶𝗼𝗻𝘀 → explode() – Flatten nested arrays or structs. → collect_list() / collect_set() – Aggregate values into lists or sets. → approx_count_distinct() – Fast estimation for large datasets 👉 Check out more here and take your Azure Data Engineering interview preparation to the next level 600+ learners have already enrolled, and many of them have successfully secured Data Engineering opportunities across top companies. 📌 𝗔𝘇𝘂𝗿𝗲 𝗗𝗮𝘁𝗮 𝗙𝗮𝗰𝘁𝗼𝗿𝘆 & 𝗔𝘇𝘂𝗿𝗲 𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀 𝗜𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄 𝗠𝗮𝘀𝘁𝗲𝗿𝘆 𝗞𝗶𝘁 https://lnkd.in/dS527dxY 📌 𝗔𝘇𝘂𝗿𝗲 𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 𝗔𝗹𝗹-𝗶𝗻-𝗢𝗻𝗲 𝗜𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄 𝗠𝗮𝘀𝘁𝗲𝗿𝘆 𝗞𝗶𝘁 https://lnkd.in/dpBtyiKf 👉 Like if you found this useful 💭 Comment your questions or feedback 👥 Follow Praveen Patel to Get Such More Like this #AzureDatabricks #PySpark #DataEngineering #BigData #DataAnalytics #DataScience #ETL #DeltaLake #SparkOptimization #CloudData #DataPipeline #AzureAnalytics
22 Comments
Like Comment
To view or add a comment, sign in
Kennedy Ouma
1w Edited
Report this post
Missing Values A practical way to handle. Missing and null values are not just an important data cleaning step. They can completely change your analysis if handled poorly. Here are practical ways I deal with missing data as a Data Analyst/Scientist 1. First, understand why the data is missing Before touching anything, I always validate; Was it not collected? Was there a system error? Is it intentionally blank? This determines the right approach because not all missing data should be treated the same way. 2. Remove rows (only when safe) If missing values are very few for instance <5% and random. Here I drop them. But if patterns exist, deleting can introduce bias. 3. Replace with meaningful values (Imputation) Some practical techniques that I personally use Numerical data → mean / median Categorical data → most frequent value Time-series → forward fill / backward fill 👉 Note that I avoid “blind imputation”,always check distribution before and after. 4. Use business logic instead of assumptions For example Example If “Discount” is missing, it may actually mean 0 (no discount) — not unknown. Always align with real-world meaning. 5. Flag missing values instead of hiding them Sometimes I create a new column: is_missing = 1 or 0 This helps models and analysis capture hidden patterns. 6. Analyze missing data as a signal Missing data can tell a story. Example: If high-income users skip certain fields, that is insight and not noise. 7. Use tools effectively for example Excel - IF, ISBLANK, Power Query SQL - COALESCE(), IS NULL Python - pandas (fillna, dropna) Power BI - Power Query transformations And finally Handling missing data is not about “fixing blanks” It’s about preserving truth in your data. If you skip this step or do it poorly, everything else (dashboards, models, insights) becomes unreliable. What is your method for handling missing values? Share your insights and approaches.
Like Comment
To view or add a comment, sign in
Akshay Mate
1w
Report this post
🚀 How I Handle Messy Date Formats in Real Projects We often get data like this 👇 📌 2026-04-21 📌 21-04-2026 📌 04212026 📌 202604 📌 random_text 👉 Looks simple… but this can break your entire pipeline if handled wrongly. ❗ Why Dates Are So Important? Dates are not just values… they drive: 📊 Reports 📈 Trends ⏱️ Time-based analysis 👉 If dates are wrong: Monthly reports become incorrect Data gets grouped in wrong periods Business decisions go wrong --------------------------------------------------------------------------- 🔶 Step 1: Load Raw Data (Bronze Layer) df = spark.read.format("csv").load("path") ✔ No cleaning ✔ No filtering 📌 Rule: Never lose original data --------------------------------------------------------------------------- 🔷 Step 2: Preserve Original Value df = df.withColumnRenamed("date", "raw_date") 👉 Keeps raw input safe for audit/debugging --------------------------------------------------------------------------- 🔷 Step 3: 🔥 Where the REAL Work Happens (Core Logic) from pyspark.sql.functions import col, to_date, coalesce df = df.withColumn( "parsed_date", coalesce( to_date(col("raw_date"), "yyyy-MM-dd"), to_date(col("raw_date"), "dd-MM-yyyy"), to_date(col("raw_date"), "MM-dd-yyyy"), to_date(col("raw_date"), "MMddyyyy"), to_date(col("raw_date"), "ddMMyyyy") ) ) 🧠 What’s ACTUALLY happening here? 👉 This is NOT just code… this is decision logic Step-by-step for each row: Let’s take: 21-04-2026 Try → yyyy-MM-dd ❌ (fails → NULL) Try → dd-MM-yyyy ✅ (success) Stop further checks 👉 Final result: 2026-04-21 💡 Why this works so well? Because of coalesce() 👉 It acts like a smart selector: Checks left → right Picks first valid result Ignores failures automatically ⚠️ Important Insight 👉 to_date() does NOT convert blindly If format doesn’t match → returns NULL That’s why we try multiple formats 👉 This is how we safely handle messy data without crashing pipeline --------------------------------------------------------------------------- 🔷 Step 4: Handle Incomplete Formats (Smart Fix) from pyspark.sql.functions import concat, lit df = df.withColumn( "parsed_date", coalesce( col("parsed_date"), to_date(concat(col("raw_date"), lit("01")), "yyyyMMdd") ) ) 👉 Example: 202604 → add day → 20260401 👉 Final → 2026-04-01 --------------------------------------------------------------------------- 🔷 Step 5: Separate Data (Quality Control) valid_df = df.filter(col("parsed_date").isNotNull()) invalid_df = df.filter(col("parsed_date").isNull()) ✔ Valid → usable ❌ Invalid → tracked separately 👉 We don’t delete… we monitor data quality --------------------------------------------------------------------------- 🔶 Step 6: Use Clean Data (Gold Layer) final_df = valid_df.select("parsed_date", "other_columns") 👉 Now data is: ✔ Consistent ✔ Reliable ✔ Analytics-ready #DataEngineering #PySpark #Databricks #ETL #BigData #DataQuality #Learning
Like Comment
To view or add a comment, sign in
Sandeep Prajapati
3w
Report this post
Let's demystify advanced SQL in 60 seconds—with patterns that scale. ❌ Myth: "JOINs are just for connecting tables." ✅ Truth: JOINs are how you model relationships, optimize performance, and express business logic. Here's the production SQL pattern I use for every data feature: -- Business question: "Which active users haven't ordered in 30 days?" WITH active_users AS ( SELECT id, name, email, last_login FROM users WHERE is_active = true ), recent_orders AS ( SELECT DISTINCT user_id FROM orders WHERE created_at >= CURRENT_DATE - INTERVAL '30 days' AND status = 'completed' ) Why this pattern scales: ✅ CTEs (WITH) = readable, testable, reusable logic blocks ✅ LEFT JOIN + WHERE IS NULL = elegant "find missing relationships" pattern ✅ Filtering in CTEs = less data shuffled in final query ✅ Explicit columns = better performance + schema resilience Index strategy for this query: CREATE INDEX idx_users_active_login ON users(is_active, last_login); CREATE INDEX idx_orders_user_date ON orders(user_id, created_at, status); Why this matters for AI engineering: ->Feature stores = SQL CTEs + indexes + caching ->Vector retrieval = JOINs on embedding similarity + metadata filters ->Evaluation pipelines = subqueries to compare predictions vs ground truth Master the pattern. Scale the intelligence. 🔧 What's your go-to pattern for finding "missing relationships" in SQL? LEFT JOIN + IS NULL? NOT EXISTS? Something custom? 👇 -- 📢 Follow Sandeep Prajapati if you enjoyed this post 📹 Reach me on https://x.com/712_sandeep for open DM's
Like Comment
To view or add a comment, sign in
Ayush Bharati
5d
Report this post
Most Data Analysts use only 5% of pandas. Then they complain it is slow. You write a for-loop over rows. You chain three .apply() calls. You merge inside a loop. The 200 MB CSV takes 40 minutes and you blame the data, the laptop, or the dataset size. The smarter question is not "how do I make pandas faster". It is "which pandas method already solved this in C". Here are 8 Pandas methods every Data Analyst should master 👇 1. .groupby().agg() Replace nested loops over categories. One line, ten times faster, and returns a clean MultiIndex you can flatten or pivot. 2. .merge() with indicator=True Joins two DataFrames AND tells you which rows matched (left_only, right_only, both). Stops the "why are my row counts off" panic before it starts. 3. .pivot_table() Reshape long to wide with aggregation in a single call. The fastest way to build a metric matrix for a Power BI or Tableau extract. 4. .query() Filter with SQL-like strings. Cleaner than chained boolean masks and 2-3x faster on large frames using the numexpr engine. 5. .assign() Chain new columns inside a method chain without breaking flow. Turns a 30-line transformation script into a readable pipeline. 6. .transform() Add a group-level metric back at the original row count (e.g., share of category total). What 90% of analysts unnecessarily write a join for. 7. pd.cut() / pd.qcut() Bucket continuous values into bins or quantiles. Stop writing if/elif ladders for age groups, revenue tiers, or RFM scores. 8. .melt() and .stack() Wide-to-long reshaping for charting tools. The pre-step every dashboard layer needs but no one teaches. How to Choose: • Need a group-level summary → .groupby().agg() • Need to validate a join → .merge(indicator=True) • Need to reshape for a report → .pivot_table() • Need readable filters → .query() • Need clean column chains → .assign() • Need a metric back at row level → .transform() • Need bins or tiers → pd.cut() / pd.qcut() • Need long format for plotting → .melt() What This Means: Most slow pandas code is not slow because pandas is slow. It is slow because the analyst wrote Python loops on top of a library written in C. Learn the vectorised methods and 100-line scripts collapse into 5. The best pandas code reads like SQL, runs like NumPy, and fits in one screen. Which pandas method did you discover late in your career? Follow Ayush Bharati for more such insights!! #DataAnalytics #DataAnalyst #Python #Pandas #DataScience #Analytics #BusinessIntelligence
Like Comment
To view or add a comment, sign in
Andrea Giacomucci
2w
Report this post
In the age of AI, SQL remains a core skill for data professionals. Here is a simple but useful technique that data analysts and aspiring analysts should know. In analytics, we usually ask: "Give me facts and enrich them with dimension attributes." For example: "Give me sales by region" or "Give me sales by sales channel." This is why, in many SQL queries, we put the fact table on the left side and join dimensions to add attributes. But what if the question is instead: "Which dimension members have no corresponding facts?" For example: "Which products in our catalogue had no sales in the last month?" It may feel a bit unusual at first, but if the goal is to find the items that exist in a dimension table but do not appear in the fact table, a LEFT JOIN from dimension to fact is a correct solution. SELECT d.item_id FROM dim_table d LEFT JOIN fact_table f ON d.item_id = f.item_id WHERE f.item_id IS NULL; An alternative that many people find even more readable is NOT EXISTS, because it expresses the intent more directly. This pattern, together with more advanced techniques (like Window functions), is perfecly explained in the Maven Analytics Advanced SQL Querying course taught by Alice Zhao. I would recommend this course to anyone who wants a solid and practical foundation in SQL. #SQL #DataModeling #Analytics #queries

2 Comments
Like Comment
To view or add a comment, sign in

75 followers

27 Posts

View Profile Connect

Darya S.’s Post

More Relevant Posts

Explore content categories