Mejdi Trabelsi, PhD’s Post

One of the most common questions I get from data teams: "𝑺𝒉𝒐𝒖𝒍𝒅 𝒘𝒆 𝒖𝒔𝒆 𝑷𝒚𝒕𝒉𝒐𝒏, 𝑷𝒚𝑺𝒑𝒂𝒓𝒌, 𝒐𝒓 𝑷𝒐𝒘𝒆𝒓 𝑸𝒖𝒆𝒓𝒚 𝒇𝒐𝒓 𝒕𝒉𝒊𝒔?" Wrong question. The right question is: what does your data look like, and who needs the output? Here's how I think about it after years of working across all three 👇 🐍 Python + Pandas — your everyday workhorse Use it when your dataset fits comfortably in memory (think under 1–2 GB), you need full flexibility for modeling, transformation, or automation, and the output feeds analysts or data pipelines. In my MMM projects, Pandas handles 90% of the data preparation work — cleaning, reshaping, feature engineering. Fast to write, easy to debug, and endlessly flexible. ⚡ PySpark — when the data fights back Use it when you're dealing with volumes that crash Pandas, processing needs to be distributed, or you're operating in a cloud environment like Databricks. On one retail project, I processed 1TB+ of transaction data across millions of rows. Pandas was simply not an option. PySpark turned a memory problem into a pipeline problem — and pipelines are solvable. 📊 Power Query / Power BI — closer to the business Use it when business users own the data refresh, the output is a dashboard consumed by non-technical stakeholders, and the transformation logic needs to be auditable without writing code. Power Query sits between Excel and a real ETL layer. It's not for engineers — it's for the business analyst who needs to own their data without depending on a data team every Monday morning. The honest advice: Don't pick a tool because you know it. Pick it because it fits the scale, the audience, and the maintenance burden. The best data professionals I've worked with don't defend their favorite tool. They ask: who will maintain this in 6 months? That question alone will save your team from a lot of pain. What's your go-to tool — and have you ever picked the wrong one? 👇 #DataEngineering #Python #PySpark #PowerBI #DataAnalytics #Analytics

To view or add a comment, sign in

More Relevant Posts

Kashish Ramnani
2w Edited
Report this post
So you want to get into Data Engineering… but don’t know where to start? I’ve been there. You hear terms like pipelines, ETL, Spark, Airflow — and suddenly it feels overwhelming. But here’s the truth: You don’t need to learn everything at once. You just need to start building. Here’s a beginner-friendly way to break into Data Engineering: 🔹 1. Understand what a pipeline really is At its core, a data pipeline is simple: Collect → Process → Store → Use That’s it. Don’t overcomplicate it. 🔹 2. Start small (seriously, tiny projects!) Pull data from an API (like weather or stock data) Clean it using Python (Pandas is your best friend) Store it in a database (MySQL/PostgreSQL) Visualize it (Power BI / Tableau) Boom — you just built your first pipeline. 🔹 3. Tools you can start with (no need to overlearn): Python 🐍 SQL 📊 Pandas Basic Cloud (AWS/GCP/Azure — pick one) Optional later: Airflow, Spark 🔹 4. Focus on consistency > complexity It’s better to build 5 simple pipelines than 1 “perfect” complicated one. 🔹 5. Think like a Data Engineer Ask yourself: Where is the data coming from? How often should it update? What happens if it fails? That mindset matters more than tools. Final tip: Don’t just learn. Document your projects. Share them. Break things. Fix them. That’s how you grow. If you're just starting out — you're not behind. You're just at the beginning of something powerful. #DataEngineering #Beginners #TechJourney #LearningInPublic #DataPipeline #Python #SQL #innove8
Like Comment
To view or add a comment, sign in
Sumit Vij
1w
Report this post
Why I’m Focusing on Data Engineering The more I work with data, the more I realize one important thing: 👉 Data is only valuable when it is clean, reliable, and available at the right time. Behind every dashboard, report, and business decision, there is a strong data pipeline making it possible. That’s one of the biggest reasons I’m focusing deeply on Data Engineering. Right now, I’m strengthening my skills in: ✅ SQL — querying and transforming data efficiently ✅ Python — automation and data processing ✅ PySpark — handling large-scale distributed data ✅ Databricks — building modern data workflows ✅ Tableau — turning raw data into meaningful insights What excites me most about Data Engineering is that it is not just about moving data from one system to another. It is about building scalable, reliable, and trusted data systems that help businesses make better decisions. Going forward, I’ll be sharing: • Practical learnings • Real-world concepts • SQL and PySpark tips • Data Engineering best practices • Insights from modern data tools Excited to keep learning, building, and growing in this journey. #DataEngineering #SQL #Python #PySpark #Databricks #Tableau #DataAnalytics #ETL #BigData
Like Comment
To view or add a comment, sign in
Ninad Patil
3w
Report this post
𝗣𝘆𝗦𝗽𝗮𝗿𝗸 𝘃𝘀 𝗦𝗽𝗮𝗿𝗸 𝗦𝗤𝗟: 𝗜 𝗱𝗼𝗻’𝘁 𝗰𝗵𝗼𝗼𝘀𝗲 𝗯𝗮𝘀𝗲𝗱 𝗼𝗻 𝗽𝗿𝗲𝗳𝗲𝗿𝗲𝗻𝗰𝗲, 𝗜 𝗰𝗵𝗼𝗼𝘀𝗲 𝗯𝗮𝘀𝗲𝗱 𝗼𝗻 𝘁𝗵𝗲 𝗷𝗼𝗯. It’s easy to turn this into a “which is better” debate. In practice, both are useful just for different reasons. And one thing is often misunderstood: Spark doesn’t execute “Python” or “SQL” the way people think. It executes a 𝗹𝗼𝗴𝗶𝗰𝗮𝗹 𝗽𝗹𝗮𝗻 -> 𝗼𝗽𝘁𝗶𝗺𝗶𝘀𝗲𝗱 𝗽𝗹𝗮𝗻 -> 𝗽𝗵𝘆𝘀𝗶𝗰𝗮𝗹 𝗽𝗹𝗮𝗻. So a lot of the time, the real difference isn’t performance, it’s 𝗵𝗼𝘄 𝗰𝗹𝗲𝗮𝗿𝗹𝘆 𝘆𝗼𝘂 𝗲𝘅𝗽𝗿𝗲𝘀𝘀 𝗶𝗻𝘁𝗲𝗻𝘁 𝗮𝗻𝗱 𝗵𝗼𝘄 𝗺𝗮𝗶𝗻𝘁𝗮𝗶𝗻𝗮𝗯𝗹𝗲 𝘁𝗵𝗲 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲 𝗶𝘀. 𝗪𝗵𝗲𝗻 𝗦𝗽𝗮𝗿𝗸 𝗦𝗤𝗟 𝘄𝗶𝗻𝘀 • The work is mostly select, join, filter, aggregate • Logic needs to be readable by more people (analysts + engineers) • I want quick iteration and clear intent • Performance tuning is easier because the query shape is obvious 𝗪𝗵𝗲𝗻 𝗣𝘆𝗦𝗽𝗮𝗿𝗸 𝘄𝗶𝗻𝘀 • I need custom logic that’s awkward in SQL • Complex parsing, nested structures, arrays/maps, JSON heavy work • Reusable functions and cleaner code structure (modules, unit tests) • Integration steps around the transformation (validation, file handling, etc.) 𝗧𝗵𝗲 𝗿𝗲𝗮𝗹 𝘁𝗿𝗮𝗱𝗲 𝗼𝗳𝗳 • SQL usually optimizes for clarity. • PySpark usually optimizes for flexibility. 𝗧𝗵𝗲 𝗯𝗲𝘀𝘁 𝗽𝗮𝘁𝘁𝗲𝗿𝗻 𝗜’𝘃𝗲 𝘀𝗲𝗲𝗻 • Use SQL for the core transformations (joins/aggregations) • Use PySpark for the edges (validation, enrichment, complex business rules) • Keep one “source of truth” so business logic doesn’t get duplicated Takeaway: Choosing PySpark vs Spark SQL isn’t a style choice. It’s a maintainability and delivery choice. Drop your go-to rule for choosing between them in the comments. #PySpark #SparkSQL #DataEngineering #Databricks #BigData #SQL #AnalyticsEngineering #DataPipelines
Like Comment
To view or add a comment, sign in
Sushant Arora
1w
Report this post
SQL has been the backbone of data analysis for decades. But writing SQL is still not natural for most people. The recent alpha release of ggsql by Posit points toward an interesting shift — making SQL more intuitive, readable, and closer to how analysts actually think. (https://lnkd.in/gEtCmCny) Instead of writing complex queries from scratch, the idea is to simplify how we express data transformations, especially for those already working in data ecosystems like R or Python. This reflects a broader trend I’ve been noticing: The gap between “data tools” and “human thinking” is slowly shrinking. We are moving toward: More expressive query layers Less boilerplate code Faster iteration for analysts For teams working with large-scale data platforms, this could reduce friction significantly — especially for analysts who spend more time translating business questions into queries than actually analyzing results. At the same time, it raises an important question: Will tools like this abstract too much, or will they actually enable better thinking by removing unnecessary complexity? In my experience, the real bottleneck in analytics is not writing SQL — it’s framing the right question. If tools like ggsql can reduce the effort spent on syntax, it could allow teams to focus more on insights and decision-making. Curious to hear — do you see this as the future of querying, or just another abstraction layer? #SQL #DataAnalytics #DataTools #RStats #AnalyticsEngineering #DataStrategy #AI #GenAI

ggsql: A grammar of graphics for SQL opensource.posit.co
Like Comment
To view or add a comment, sign in
Naga adarsh
6d Edited
Report this post
Excited to share one of my recent builds: Unified Project Analytics & Telemetry Platform 🚀 As I worked on multiple personal projects, I noticed each one was generating valuable data such as logs, metrics, predictions, clicks, response times, and usage events. Instead of keeping everything isolated, I built a centralized platform to collect, organize, and analyze all of it in one place. LINK: https://lnkd.in/gskevhbR Integrated projects: • URL Shortener • Freshness Indicator • RAG QA System What the platform does: • Collects telemetry from multiple applications • Ingests events through REST APIs • Runs ETL pipelines for cleaning and aggregation • Stores structured analytics data in SQL • Visualizes insights through dashboards Tech Stack: Python | Pandas | FastAPI | SQL | Power BI | Git | ETL Key Insights Tracked: • Click analytics • Prediction trends • Response latency • Usage metrics • Cross-project performance monitoring Building this project gave me hands-on experience in centralized observability, analytics pipelines, schema design, backend APIs, and end-to-end data engineering workflows. Always learning and building. Open to feedback and opportunities in Data Engineering / Backend / Analytics roles. #DataEngineering #Python #SQL #FastAPI #PowerBI #ETL #Analytics #BackendDevelopment #Projects
Like Comment
To view or add a comment, sign in
Fimijoba Micheal Oladokun
1w
Report this post
Grouping data by multiple dimensions is one of the most fundamental operations in data analysis and pandas groupby() is the tool that makes it fast, flexible, and readable. Whether you need total revenue by region and product, average salary by department and job level, or headcount by team and location — the pattern is always the same: pass a list of columns, choose your aggregation, reset the index. The power comes when you combine it with named aggregation for multiple clean metrics, transform() for group statistics per row, and filter() for removing entire groups. Master these patterns and transforming raw data into multi-dimensional summaries becomes one of the fastest steps in your analysis workflow. Read the full post here: https://lnkd.in/eU3Vb3Sz #Python #Pandas #DataScience #DataAnalysis #DataEngineering #Analytics

How to Group Data by Multiple Columns in Pandas https://codewithfimi.com

1 Comment
Like Comment
To view or add a comment, sign in
Vishal Kaushal
1w
Report this post
Dear Aspiring Data engineers Stop collecting pdf and watching tutorials. Start building projects. Start here fundamentals first: 1. Build an end-to-end ETL pipeline using Python + SQL. No shortcuts. Understand every layer. 2. Ingest data from S3 into Snowflake using Python. Schedule it with Airflow. Handle failures. 3. Extract Data from Python dump in SQL and build Power BI report on top of it. 4. Pull live data from a public API (weather, crypto, sports) → raw storage → dbt transformations → analytics layer 5. Build a file ingestion pipeline for CSV/Excel → automate it → log every failure with context 6. Process JSON log data → parse nested fields → flatten → load into Snowflake 7. Replace your full refresh with incremental CDC loading → measure the performance difference yourself 8. Build a real-time streaming pipeline with Kafka → process events → serve analytics-ready data 9. Build a data quality framework from scratch → null checks, duplicate detection, schema validation using Python + dbt tests 10. Design a proper Star Schema for an e-commerce dataset → fact + dimension tables → connect a BI tool 11. Orchestrate 3+ pipelines in Airflow with real dependencies, retry logic, and Slack alerts 12. Pick any ELT tool like Fivetran or Matillion and build end to end Data migration project. 13. Flow Data from S3 to Snowflake and Snowflake to Azure Blob . Cross connection to get comfortable with multiple cloud storage. 14. Build a full ELT pipeline → raw → staging → marts in dbt → follow the exact patterns used in production 15. Connect your data mart to Power BI or Tableau → build a dashboard a business user can actually use Don't ask which tool should I learn next? Ask what problem can I solve today? Tools are just instruments. Problem-solving is the skill. And the only way to develop it is by building things that break and fixing them yourself. That's what separates a Data Engineer from a Staff Data Engineer. #DataEngineering #Snowflake #dbt #Airflow #Python #ETL #DataPipeline #CloudLearningYard

4 Comments
Like Comment
To view or add a comment, sign in
Khushi Shiroya
6d
Report this post
Ever wondered if code can iterate efficiently, why can’t data pipelines? 🤔 Databricks For Each task answers exactly that. Simplify repetitive workflows with the For Each task in Databricks Jobs. It lets you loop through a list of inputs — table names, regions, IDs — and run a nested task (notebook, SQL, or Python script) for each item. Each iteration runs independently and can even run in parallel. ⚡ Now you might think, “Creating a loop inside a job must be complex, right?” Not at all — it’s actually just 3 simple steps 👇 1️⃣ Create a list of parameters (e.g., countries) 2️⃣ Pass that list to a For Each task 3️⃣ Run one nested notebook that dynamically picks up each value ✨Bonus: Only failed iterations rerun — No more wasting time reprocessing 10 items when just 2 failed. A huge time-saver! ✅ What makes it great: → Enables parallel execution with configurable concurrency (1–100) → Retries only failed iterations, saving time and frustration → Optimizes cost by eliminating redundant processing ⚠️ Worth knowing: → A For Each task can contain only one nested task → Nested For Each (loops inside loops) isn’t supported → Works best with simple lists or flat JSON — deeply nested structures can get tricky A small feature, but a big step toward more modular and scalable pipelines. 🚀 #DataEngineering #Databricks #DataPipelines #ETL #LearningInPublic
Like Comment
To view or add a comment, sign in
Aditya Kumar
3w
Report this post
Still using Excel or Google Sheets for daily reporting and data preparation? It might be time to rethink your approach. While spreadsheets are great for quick analysis, they often fall short when it comes to handling large datasets, repetitive workflows, and scalable ETL (Extract, Transform, Load) processes. Here’s where Python steps in 👇 🔹 Data Extraction With Python libraries like pandas, requests, or database connectors, you can automatically pull data from multiple sources — APIs, databases, CSVs — without manual effort. 🔹 ETL Process (Extract → Transform → Load) Instead of repetitive Excel formulas and copy-paste steps: Clean and transform data programmatically Apply complex logic consistently Automate recurring workflows 🔹 Structured Data Pipelines Build a proper, reusable pipeline: Raw Data → Cleaning → Transformation → Validation → Final Output This ensures consistency, reduces errors, and saves time. 🔹 Handling Large Datasets Excel and Sheets struggle with scale. Python can efficiently process millions of rows without crashing or slowing down your workflow. 🔹 Automation = Efficiency Schedule your scripts to run daily reports automatically. No manual intervention. No missed steps. 💡 The result? Faster processing, fewer errors, scalable workflows, and more time to focus on insights instead of manual data prep. If you're still relying heavily on spreadsheets for ETL, it’s worth exploring Python — even small steps can lead to massive productivity gains. #DataEngineering #Python #ETL #Automation #DataAnalytics #Productivity
Like Comment
To view or add a comment, sign in
Mohan Nayak
2w
Report this post
🚀 𝐌𝐚𝐬𝐭𝐞𝐫𝐢𝐧𝐠 𝐏𝐚𝐧𝐝𝐚𝐬 = 𝐌𝐚𝐬𝐭𝐞𝐫𝐢𝐧𝐠 𝐃𝐚𝐭𝐚 𝐀𝐧𝐚𝐥𝐲𝐬𝐢𝐬. If you're stepping into Data Analytics, this cheat sheet is your best friend 💡 Here are some must-know Pandas functions that every analyst should have at their fingertips: 🔹 Data Loading `read_csv()` | `read_excel()` 🔹 Quick Exploration `head()` | `info()` | `describe()` | `shape` 🔹 Data Cleaning `isnull()` | `dropna()` | `fillna()` | `drop_duplicates()` 🔹 Data Transformation `rename()` | `astype()` | `apply()` 🔹 Data Analysis `groupby()` | `pivot_table()` | `value_counts()` 🔹 Data Selection `loc[]` | `iloc[]` | `query()` 🔹 Data Merging `merge()` | `concat()` 💥 Pro Tip: Don’t just memorize practice on real datasets. That’s where real learning happens. 📊 Pandas is not just a library… it’s the backbone of modern data analysis. If you're serious about becoming a Data Analyst or Data Engineer, start mastering these today. 👉 Which Pandas function do you use the most? 👇 Drop it in the comments! 🔁 Repost if this helps 👍 Like for more such content 📌 Follow me for daily Data Analytics tips #Pandas #Python #DataAnalytics #DataScience #Learning #CareerGrowth #DataEngineer #ExcelToPython
12 Comments
Like Comment
To view or add a comment, sign in

2,109 followers

61 Posts

View Profile Follow

Mejdi Trabelsi, PhD’s Post

More Relevant Posts

Explore content categories