Data Driven Community’s Post

1mo

🚀 New Webinar: Fabric Data Engineering with Python Notebooks 📅 April 2, 2026 | 12:00–1:30 PM EDT | Online If you’re building on Microsoft Fabric and looking to do more with less, this session is going to be a game‑changer. Python notebooks are quickly becoming the most cost‑efficient and flexible way to engineer data in Fabric—especially for small teams and organizations watching capacity consumption closely. In this webinar, we’ll explore how to design smarter pipelines using modern libraries like Polars, Delta Lake, DuckDB, and MS SQL, and how to evaluate cost tradeoffs using the Capacity Metrics app. 🎤 Speaker: John Miner Senior Data Architect at Insight Digital Innovation 10x Microsoft MVP | 30+ years of data engineering expertise John will walk through practical patterns, real‑world examples, and cost‑optimized design strategies you can apply immediately. 💡 You’ll learn: - Why Spark notebooks and Dataflows Gen2 can be more expensive than Python notebooks - How to build efficient ETL pipelines using modern Python data libraries - How to compare engineering designs using Fabric’s Capacity Metrics - How small companies can maximize value with minimal capacity 🔗 Register here: https://lnkd.in/dnm6irSM FutureDataDriven CloudDataDriven #microsoftfabric #dataengineering #python

Fabric Data Engineering with Python Notebooks, Thu, Apr 2, 2026, 12:00 PM | Meetup meetup.com

1 Comment

John Miner 1mo

Just added slide deck and code to my local repo. The "show tables" does not work for the new Fabric Lakehouse Schemas version. Reported the bug to the product team. Everything else was double checked before publishing. Hope you enjoyed the talk! https://github.com/JohnMiner3/community-work/tree/master/fabric-python-notebooks

3 Reactions

To view or add a comment, sign in

More Relevant Posts

Jacob Joshua
1w Edited
Report this post
Raw data doesn’t become useful because you visualise it – it becomes useful because you model it properly. SQL for shaping logic. Python for cleaning and exploration. dbt for turning transformations into reliable, version-controlled data products. And GitHub is where all of it stops being “analysis” and starts becoming engineering. That’s the shift: from writing queries to building systems.
Like Comment
To view or add a comment, sign in
Harshit Tiwari
4w
Report this post
🔄 Every real Data Science project follows a lifecycle — not just a Jupyter notebook. From defining business goals → acquiring data → EDA → modeling → evaluation → deployment & monitoring. The part most beginners skip? Business Understanding and MLOps — the two ends that actually determine if your model creates value in production. Which stage do you find most challenging? Drop it in the comments 👇 #DataScience #MachineLearning #MLOps #DataEngineering #Python
Like Comment
To view or add a comment, sign in
David Ocampo
3w
Report this post
📅 Day 73 of #100DaysOfCode — and today the data told a story I didn't expect! Today's focus: data visualization with Matplotlib using real StackOverflow data on programming language popularity from 2008 to 2020. Here's what I worked through today: 🔧 Renamed DataFrame columns using the names parameter in read_csv() for cleaner, more readable data 📅 Converted messy datetime strings into proper pandas datetime objects — a crucial data cleaning step before any time series analysis 🔍 Used groupby() + sum() + idxmax() to identify the most popular programming language of all time by total posts (spoiler: JavaScript 👑) 📊 Filtered DataFrames using boolean indexing to isolate specific languages for visualization 📈 Plotted time series data with Matplotlib — first a single language, then overlaid two languages on the same chart The most compelling insight? The chart says it all: 🔵 Java peaked around 2013-2014 and has been declining ever since 🟠 Python has been on a relentless rise — and by 2020, it's not even close The numbers don't lie. If you're wondering whether to learn Python, the StackOverflow community already voted with their questions. Onward to Day 74! 💪 #Python #Pandas #Matplotlib #DataVisualization #100DaysOfCode #DataScience #ContinuousLearning #MicrosoftFabric
1 Comment
Like Comment
To view or add a comment, sign in
Dinesh Kumar
3w
Report this post
🚀 Day 12/20 — Python for Data Engineering Filtering & Selecting Data (Pandas) Now that we know what a DataFrame is… 👉 The real work starts here: getting only the data you need 🔹 Selecting Columns df["name"] 👉 Select a single column df[["name", "salary"]] 👉 Select multiple columns 🔹 Filtering Rows df[df["salary"] > 50000] 👉 Get rows based on condition 🔹 Multiple Conditions df[(df["salary"] > 50000) & (df["age"] < 30)] 👉 Combine conditions 🔹 Why This Matters Reduce unnecessary data Focus on relevant records Improve performance 🔹 Real-World Use 👉 Raw Data → Filter → Useful Data 💡 Quick Summary Selecting = columns Filtering = rows 💡 Something to remember You don’t need all the data… You need the right data. #Python #DataEngineering #DataAnalytics #LearningInPublic #TechLearning #Databricks
Like Comment
To view or add a comment, sign in
BUKYA NARESH
3w
Report this post
How I bypassed the Pandas "Object Tax" to process 10 Million rows 8x faster with 78% less RAM. 🏎️💨 Standard Python data pipelines are bleeding compute cash. When you run pd.read_csv() on a massive file, Python loads the entire thing into memory and wraps every single value in a heavy Python object. This "Object Tax" is what causes your server to spike in cost and eventually crash with an "Out of Memory" (OOM) error. The Baseline (10 Million Rows / ~400MB CSV): ❌ Standard Pandas: 10.61 seconds | 1,738 MB RAM The Solution: I built Axiom-CSV, a custom C-extension for Python that uses memory mapping (mmap) and pointer arithmetic. It scans the raw bytes directly from the disk and calculates aggregations on the fly, entirely bypassing the Python heap. The Axiom Benchmark: ✅ Axiom-CSV (C-Bridge): 1.34 seconds | 375 MB RAM The ROI (Why this matters): By dropping the memory footprint by 78%, you can process enterprise-level datasets on a $5/month AWS t2.micro instead of a $40/month high-memory instance. You don't need "more RAM." You need better architecture. The Proof & Code: https://lnkd.in/gd-FBdvB DM me: I am conducting 2 architecture audits this week for teams hitting performance walls in their Python pipelines. Let’s translate your latency into balance sheet savings. #Python #DataEngineering #PerformanceEngineering #CProgramming #SystemsArchitecture #CloudOptimization #Pandas #ZeroLatency
1 Comment
Like Comment
To view or add a comment, sign in
Dinesh Kumar
4w
Report this post
🚀 Day 4/20 — Python for Data Engineering Reading & Writing Files (CSV / JSON) In data engineering, data rarely comes clean. 👉 It usually comes from: files logs exports APIs So the ability to read and write data is fundamental. 🔹 Why File Handling Matters We often: ingest raw data process it store cleaned output 👉 Python helps us do all of this easily. 🔹 Reading a CSV File import pandas as pd df = pd.read_csv("data.csv") print(df.head()) 👉 Loads structured data into a DataFrame 🔹 Reading a JSON File import json with open("data.json") as f: data = json.load(f) print(data) 👉 Useful for API responses and semi-structured data 🔹 Writing Data to a File df.to_csv("output.csv", index=False) 👉 Save processed data for further use 🔹 Where You’ll Use This Data ingestion pipelines Data transformation workflows Exporting results Logging and backups 💡 Quick Summary Python allows you to: read data from multiple formats process it write it back efficiently 💡 Something to remember Data engineering starts with reading data… and ends with writing it in a better form. #Python #DataEngineering #DataAnalytics #LearningInPublic #TechLearning #Databricks
Like Comment
To view or add a comment, sign in
Oussema Benkhaoua
1mo
Report this post
Most small businesses lose hours every week updating data manually. ⏳ I recently built a reliable Python pipeline that handles the heavy lifting: ✅ Fetches data directly from APIs ✅ Cleans data & removes duplicates ✅ Stores everything in a structured PostgreSQL database ✅ Updates automatically every day No more manual copy-paste. No more messy spreadsheets. 🚫📊 This is a game-changer if you deal with: • Growing Excel files that crash constantly • API data that needs daily manual updates • Repetitive, boring reporting tasks If this sounds familiar, I can help you automate your workflow and reclaim your time. 🚀 Check out the Demo & Code here: 👇 https://lnkd.in/dyXCXSPk #DataAutomation #Python #ETL #SmallBusiness #Automation
Like Comment
To view or add a comment, sign in
Priya Kumari
1w
Report this post
I still remember the day our backend system crashed under 10 million rows of user data. It was 2 AM. The ETL pipeline was choking. My first instinct? Write more loops in Python. Big mistake. That's when I learned the hard way: raw Python loops don't scale. But Pandas and NumPy do. Here's what changed everything: Instead of iterating row by row, I switched to vectorized operations with NumPy. What took 45 minutes dropped to under 3 minutes. For data transformations, I started using Pandas apply() with axis parameters and groupby() aggregations instead of nested loops. Memory usage dropped by 60%. Three practices that saved our backend: 1. Specify dtypes upfront when reading CSVs. Loading only int32 instead of int64 cut memory in half for large datasets. 2. Use chunksize for massive files. Processing 50 million rows in 100k chunks kept our servers stable. 3. Convert categorical columns to category dtype. This single change reduced memory by 70% on dimension tables. The result? Our data pipeline now handles 50 million records daily without breaking a sweat. The lesson: Efficient data processing isn't about writing more code. It's about writing smarter code. What's your go-to optimization trick for handling large datasets? #Python #BackendDevelopment #DataEngineering #SoftwareEngineering
Like Comment
To view or add a comment, sign in
MOHAMMED AMAAN QURAISHI
1w
Report this post
🚀 Day 67 – Project Work | Pandas for Data Handling Today I worked with Pandas, one of the most important Python libraries for data manipulation in Machine Learning projects 📊🐼 🔹 What I worked on today: ✔️ Loaded dataset using Pandas ✔️ Cleaned missing values ✔️ Handled duplicates & inconsistencies ✔️ Performed basic data analysis ✔️ Converted data into model-ready format 🔹 Key Concepts I used: 👉 DataFrames & Series 👉 Data cleaning techniques 👉 Filtering & selecting data 👉 Feature preparation 🔹 How it helped my project: 🎯 Improved data quality before prediction 🎯 Made preprocessing pipeline more efficient 🎯 Better understanding of real-world messy data 🔹 Challenges: ⚡ Handling null values correctly ⚡ Choosing the right preprocessing steps ⚡ Managing large datasets 🔹 What I learned: 💡 Good data = Good model performance 💡 Pandas is the backbone of data preprocessing 💡 Small cleaning steps make a big difference 📌 Next Step: Integrate Pandas preprocessing directly into my FastAPI pipeline 🚀 #Day67 #Pandas #DataScience #MachineLearning #FastAPI #Python #ProjectWork
Like Comment
To view or add a comment, sign in
Faith Chuwang-Kwa
4w
Report this post
Week 9: The Data Immersed Python Cohort This week focused on data visualization using Matplotlib, turning raw data into clear, actionable insights. I built a supply chain dashboard showing: • Cost trends over time • Supplier performance (cost & order volume) • Cost distribution patterns • Relationship between order quantity and cost Key insight: Most orders are low-cost, but a few high-value orders drive overall variability. Supplier A stands out as both the most used and most expensive supplier. TheData Immersed(TDI) Anne Nnamani Python Software Foundation

1 Comment
Like Comment
To view or add a comment, sign in

1,518 followers

View Profile Follow

Data Driven Community’s Post

More from this author

Year In Review For 2021 - Data Driven Community

Explore content categories