Prashant Uswadkar’s Post

*** 👇Behind the scenes of write_pandas 👇 *** Most of us use write_pandas in Snowflake like this: 👉 Pass a DataFrame → data gets loaded into a table Simple, right? 🔍 What’s actually happening behind the scenes It’s not a direct insert. The process is more like a mini pipeline: 1️⃣ DataFrame → File conversion Your DataFrame is first converted into files (typically CSV/Parquet) 2️⃣ Upload to stage These files are uploaded to a temporary/internal stage 3️⃣ COPY INTO execution Snowflake runs a COPY INTO command to load data from stage into the table 4️⃣ Cleanup Temporary files are cleaned up after loading 🚨 Why this matters Understanding this helped me debug issues like: • Permission errors (stage access required) • Performance bottlenecks • Unexpected failures in bulk upload 💡 Key Insight write_pandas is not just a function… 👉 It’s an abstraction over file upload + COPY pipeline Lesson: When debugging, don’t just look at the function… 👉 Look at what’s happening underneath. Have you explored what happens behind the scenes of the tools you use? #Snowflake #DataEngineering #Python #Learning #Debugging #Cloud

To view or add a comment, sign in

More Relevant Posts

Yash Pandey
3w
Report this post
Worked on a task recently where I built a simple data pipeline end-to-end 👇 Here’s what I did: 🔹 Built a Streamlit app Allowed users to upload Excel files 🔹 Performed data cleaning Handled headers, null values, and formatting 🔹 Converted data to CSV Prepared it for further processing 🔹 Uploaded data to AWS S3 Used Python to simulate a real-world data ingestion flow 🔹 Loaded data into Snowflake ❄️ Made it ready for analysis and querying --- 💡 What I learned: Building even a simple pipeline gives a much better understanding than just learning concepts It connects everything — from data ingestion to storage to analysis --- Still improving and exploring more real-world use cases 🚀 Would love to know — what kind of data projects are you currently working on? 🤔 #DataEngineering #Snowflake #AWS #Streamlit #LearningInPublic
Like Comment
To view or add a comment, sign in
Prashant Uswadkar
6d
Report this post
*** How we mitigated write_pandas issues in Snowflake 👇*** Faced an issue where write_pandas was not working… Even after having full access on the target table. 🔍 Root cause write_pandas doesn’t directly insert into the table. 👉 It uses an internal flow: • Upload data to a stage • Run COPY INTO So, stage access becomes mandatory ⚙️ How we mitigated it ✅ Option 1 — Grant required access (recommended) Provide necessary permissions on stage: • USAGE • WRITE 👉 This allows write_pandas to work as expected ⚠️ Option 2 — Workaround (when access is restricted) If direct stage access is not allowed due to security constraints: • Create a separate (dummy) database & schema • Grant USAGE + WRITE on stage there • Create a temporary table • Upload data using write_pandas into temp table • Use COPY INTO / INSERT to move data to target table 💡 Key Insight Sometimes the issue is not the tool… 👉 It’s the access layer behind it Lesson: Understand the internal workflow before deciding the solution. Have you used any alternative approach for bulk uploads in Snowflake? #Snowflake #DataEngineering #Python #Cloud #Debugging #Learning
Like Comment
To view or add a comment, sign in
Ankit Yadav 🇮🇳
3w
Report this post
Why 𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀 𝗔𝘂𝘁𝗼 𝗟𝗼𝗮𝗱𝗲𝗿 is my preferred choice for scalable data ingestion When your pipelines deal with millions of files, manually tracking processed data does not scale. It adds complexity, creates fragile workflows, and turns ingestion into a maintenance problem. That is where Databricks Auto Loader stands out. It is built to automatically detect and ingest new files with minimal setup, whether the source data is CSV, JSON, Parquet, or Avro. Instead of writing custom logic to monitor directories and track file state, you can focus on building reliable pipelines. A few features I find especially useful: ✅ File type filtering When the source location contains mixed file formats, Auto Loader lets you process only the ones you need. That means less noise and cleaner ingestion. ✅ Glob pattern directory filtering It can read across multiple subfolders without hardcoding every path, which makes pipelines much easier to maintain as directory structures grow. ✅ cloudFiles.cleanSource options Managing the landing zone becomes simpler with cleanup options that fit different needs: OFF keeps files as they are DELETE removes files after retention MOVE archives files to another location For large-scale ingestion, this combination of flexibility and automation saves a lot of operational effort. Have you used Auto Loader in production? What feature or use case has been most valuable for you? #Databricks #AutoLoader #DataEngineering #BigData #ETL #DataPipelines #CloudEngineering #ApacheSpark #AzureDatabricks #CareerGrowth #TechInterviews #Naukri #sql #python
20 Comments
Like Comment
To view or add a comment, sign in
SAMRAT ASHOK CHAKKARAWARTHY A K
2w
Report this post
I’m excited to share the updated PySpark Cheatsheet v3.0 ⚡✨ This version is built to make your learning faster and your daily work smoother 🧠💡 --- 🔷 What’s improved 🔹 → Clearer examples 📚 → Practical, real-use references 📌 → Faster lookup while working 💻 → Better structure for quick understanding 🔗 --- 🔷 Why this matters 🔸 PySpark isn’t about memorizing syntax ❌ It’s about writing efficient, scalable transformations ⚙️📈 This cheatsheet helps you focus on what actually matters 🎯 --- 🔷 How to use it 🪝 → Keep it open while coding 💻 → Use it as a quick reference 📜 → Practice alongside real datasets 📊 → Revisit often 🔖 --- Small improvements every day lead to big gains 🚀💫 Use this to speed up your workflow and build confidence 💪✨ --- Save this 🔖📌 Start applying today 🎯 --- #pyspark #dataengineering #bigdata #spark #databricks #learningjourney #cheatsheet https://lnkd.in/gGbcr97X
Like Comment
To view or add a comment, sign in
Prathyusha K.
2w
Report this post
Swipe through the slides first 👉 then read below 👇 🚀 Day 30 of 30 — 30 Days of PySpark. Complete. 30 days ago I couldn't explain what a partition was. Today I built 3 production-grade pipelines and published them on GitHub. Here's my honest 30-day review 👇 📅 What I built in 30 days 3 real pipelines: → Day 15: Sales analysis pipeline (CSV → Parquet) → Day 24: Log file analysis pipeline (raw logs → Delta) → Day 29: E-commerce capstone (multi-source → Delta + tests) 30 LinkedIn posts documenting every step publicly. 📊 The full roadmap in one place Week 1 — Foundations SparkSession → RDDs → DataFrames → Lazy evaluation Columns & Expressions → Read/Write files Week 2 — Querying Filtering → Aggregations → Window functions Joins → String & Dates → Null handling Week 3 — Engineering Mini Project → Spark SQL → UDFs Partitions → Caching → Schema management Week 4 — Real world Databases → Delta Lake → Log pipeline Error handling → Testing → Cloud deployment ✅ My honest reflection Hardest concept overall → Window functions (Day 10) Most underrated skill → Error handling (Day 25) Most career-relevant → Delta Lake (Day 23) Biggest surprise → How similar local and cloud code is Time per day → ~1–1.5 hours What I'd do differently: → Build the first pipeline on Day 7 not Day 15 → Write tests from Day 1, not Day 26 → Deploy to Databricks earlier in the journey 🚀 What's next → Structured Streaming — process data as it arrives → MLlib — machine learning at scale with PySpark → Databricks Certified Associate Developer exam → Building a real-time dashboard on top of Delta Lake 💡 My final takeaway PySpark isn't just a tool. It's a way of thinking — distributed, lazy, parallel. Once that mental model clicks, everything else falls into place. 30 posts. 3 pipelines. 0 days wasted. ❓ If you followed this journey — what was the most useful day for you? Drop it in the comments 👇 Save this post as your complete PySpark reference. 🔖 #PySpark #DataEngineering #BigData #Python #LearnInPublic #30DaysOfPySpark #DataScience #DataEngineer
Like Comment
To view or add a comment, sign in
Ninad Patil
3w
Report this post
𝗣𝘆𝗦𝗽𝗮𝗿𝗸 𝘃𝘀 𝗦𝗽𝗮𝗿𝗸 𝗦𝗤𝗟: 𝗜 𝗱𝗼𝗻’𝘁 𝗰𝗵𝗼𝗼𝘀𝗲 𝗯𝗮𝘀𝗲𝗱 𝗼𝗻 𝗽𝗿𝗲𝗳𝗲𝗿𝗲𝗻𝗰𝗲, 𝗜 𝗰𝗵𝗼𝗼𝘀𝗲 𝗯𝗮𝘀𝗲𝗱 𝗼𝗻 𝘁𝗵𝗲 𝗷𝗼𝗯. It’s easy to turn this into a “which is better” debate. In practice, both are useful just for different reasons. And one thing is often misunderstood: Spark doesn’t execute “Python” or “SQL” the way people think. It executes a 𝗹𝗼𝗴𝗶𝗰𝗮𝗹 𝗽𝗹𝗮𝗻 -> 𝗼𝗽𝘁𝗶𝗺𝗶𝘀𝗲𝗱 𝗽𝗹𝗮𝗻 -> 𝗽𝗵𝘆𝘀𝗶𝗰𝗮𝗹 𝗽𝗹𝗮𝗻. So a lot of the time, the real difference isn’t performance, it’s 𝗵𝗼𝘄 𝗰𝗹𝗲𝗮𝗿𝗹𝘆 𝘆𝗼𝘂 𝗲𝘅𝗽𝗿𝗲𝘀𝘀 𝗶𝗻𝘁𝗲𝗻𝘁 𝗮𝗻𝗱 𝗵𝗼𝘄 𝗺𝗮𝗶𝗻𝘁𝗮𝗶𝗻𝗮𝗯𝗹𝗲 𝘁𝗵𝗲 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲 𝗶𝘀. 𝗪𝗵𝗲𝗻 𝗦𝗽𝗮𝗿𝗸 𝗦𝗤𝗟 𝘄𝗶𝗻𝘀 • The work is mostly select, join, filter, aggregate • Logic needs to be readable by more people (analysts + engineers) • I want quick iteration and clear intent • Performance tuning is easier because the query shape is obvious 𝗪𝗵𝗲𝗻 𝗣𝘆𝗦𝗽𝗮𝗿𝗸 𝘄𝗶𝗻𝘀 • I need custom logic that’s awkward in SQL • Complex parsing, nested structures, arrays/maps, JSON heavy work • Reusable functions and cleaner code structure (modules, unit tests) • Integration steps around the transformation (validation, file handling, etc.) 𝗧𝗵𝗲 𝗿𝗲𝗮𝗹 𝘁𝗿𝗮𝗱𝗲 𝗼𝗳𝗳 • SQL usually optimizes for clarity. • PySpark usually optimizes for flexibility. 𝗧𝗵𝗲 𝗯𝗲𝘀𝘁 𝗽𝗮𝘁𝘁𝗲𝗿𝗻 𝗜’𝘃𝗲 𝘀𝗲𝗲𝗻 • Use SQL for the core transformations (joins/aggregations) • Use PySpark for the edges (validation, enrichment, complex business rules) • Keep one “source of truth” so business logic doesn’t get duplicated Takeaway: Choosing PySpark vs Spark SQL isn’t a style choice. It’s a maintainability and delivery choice. Drop your go-to rule for choosing between them in the comments. #PySpark #SparkSQL #DataEngineering #Databricks #BigData #SQL #AnalyticsEngineering #DataPipelines
Like Comment
To view or add a comment, sign in
Ahmad Faisal Bin Zulkifli
1w
Report this post
🚀 Built My First Production-Style Data Pipeline with Apache Airflow Over the past few days, I’ve been working on turning a simple data script into something closer to a production-ready ETL pipeline — and I wanted to share what I learned. 🔧 What I built: File ingestion using Airflow FileSensor Data validation (empty checks + required columns) Data cleaning & transformation (postcode extraction + normalization) Row-level validation using pandas masks Separation of valid vs invalid data Logging & tracking of failed records Deduplication based on business key Archiving of original input files after processing 📊 Key takeaway: It’s not just about “making code work” — it’s about making it: Reliable Traceable Maintainable 💡 Some lessons learned: Don’t trust raw data — always validate Never silently drop bad data — log it Separate pipeline stages clearly (load → validate → transform → archive) Small details (like file paths, logging, and error handling) matter a lot in real systems Next step for me: ➡️ Integrate this pipeline with AWS (S3) to simulate real-world cloud workflows Really enjoyed this process — starting to understand how real data engineering workflows are designed. #DataEngineering #ApacheAirflow #Python #Pandas #ETL #LearningJourney #SoftwareDevelopment
Like Comment
To view or add a comment, sign in
Varun kumar Mundrathi
3d
Report this post
🚀 Building Data Pipelines with Apache Airflow + Docker After working with SQL, Pandas, and PySpark for data processing, I took the next step by orchestrating end-to-end data pipelines using Apache Airflow. 🔧 What I implemented: • Designed DAGs with task dependencies (start → process → end) • Executed Python-based data processing scripts using BashOperator • Organized projects into pipelines, scripts, and data layers • Set up Airflow with LocalExecutor using Docker (webserver + scheduler) 🧠 Key learnings: • How Airflow orchestrates end-to-end data workflows • The role of Scheduler vs Webserver in execution • Docker volume mapping (local vs container paths) • Debugging real-world issues like missing scheduler, path mismatches, and log visibility This wasn’t just about writing code — it was about understanding how systems behave in real environments and solving practical issues. 🚀 Next steps: • Build a complete ETL pipeline (extract → transform → load) • Integrate PySpark jobs with Airflow #DataEngineering #Airflow #PySpark #SQL #Docker #Python #ETL #LearningByDoing
Like Comment
To view or add a comment, sign in
Muhammad Shahbaz
1w Edited
Report this post
Before you scale your stack, fix your foundation. Learn SQL before Snowflake Learn Python before Databricks Learn Data Warehousing before dbt Learn File Formats before Data Lakes Learn Batch Processing before Streaming Because without fundamentals, tools don’t make you effective, they make you dependent. What I see across teams: * dbt without Data Warehousing → turns into a black box * Data Lakes without file format understanding → become data swamps * Snowflake without SQL → just an expensive UI * Databricks without Python → same story Reality check: Tools evolve every year. Fundamentals haven’t changed in decades. The teams that win are not tool experts. They are first-principles thinkers who understand how data actually works. At Epoc Labs, we’ve seen this firsthand: The biggest breakthroughs don’t come from adding new tools. They come from fixing what should’ve been understood from day one. If you’re building data infrastructure or modernizing reporting, this mindset will save you years of time.
Like Comment
To view or add a comment, sign in
Aditi Singhal
1mo
Report this post
𝐃𝐚𝐲 𝟏𝟕/𝟐𝟓 𝐨𝐟 #𝟐𝟓𝐃𝐚𝐲𝐬𝐎𝐟𝐒𝐩𝐚𝐫𝐤 — 𝐔𝐬𝐞𝐫-𝐃𝐞𝐟𝐢𝐧𝐞𝐝 𝐅𝐮𝐧𝐜𝐭𝐢𝐨𝐧𝐬 (𝐔𝐃𝐅𝐬) 𝐢𝐧 𝐏𝐲𝐒𝐩𝐚𝐫𝐤 Today’s focus was on extending Spark’s capabilities using UDFs when built-in functions just aren’t enough. But with great power comes great responsibility… 🔍 𝘾𝙤𝙫𝙚𝙧𝙚𝙙 𝙩𝙤𝙥𝙞𝙘𝙨: 𝐖𝐡𝐚𝐭 𝐢𝐬 𝐚 𝐔𝐃𝐅 𝐚𝐧𝐝 𝐰𝐡𝐞𝐧 𝐭𝐨 𝐮𝐬𝐞 𝐢𝐭? 𝐃𝐢𝐟𝐟𝐞𝐫𝐞𝐧𝐜𝐞 𝐛𝐞𝐭𝐰𝐞𝐞𝐧: ✅ 𝐒𝐜𝐚𝐥𝐚𝐫 𝐔𝐃𝐅 (𝐫𝐨𝐰-𝐰𝐢𝐬𝐞) ✅ 𝐏𝐚𝐧𝐝𝐚𝐬 𝐔𝐃𝐅 (𝐯𝐞𝐜𝐭𝐨𝐫𝐢𝐳𝐞𝐝, 𝐦𝐨𝐫𝐞 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐭) 𝐖𝐡𝐞𝐧 𝐍𝐎𝐓 𝐭𝐨 𝐮𝐬𝐞 𝐔𝐃𝐅𝐬 𝐖𝐡𝐲 𝐔𝐃𝐅𝐬 𝐦𝐢𝐠𝐡𝐭 𝐜𝐚𝐮𝐬𝐞 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞 𝐢𝐬𝐬𝐮𝐞𝐬 𝐇𝐚𝐧𝐝𝐬-𝐨𝐧 𝐜𝐨𝐝𝐞 𝐞𝐱𝐚𝐦𝐩𝐥𝐞𝐬 𝐟𝐨𝐫 𝐞𝐚𝐜𝐡 𝐭𝐲𝐩𝐞 🧪 Highlight: Using Pandas UDFs can significantly improve performance by leveraging Apache Arrow under the hood! But remember: if your use case can be solved with built-in functions always prefer those over UDFs. 🔜 Tomorrow (Day 18) we’ll explore: Built-in Spark SQL functions your go-to toolbox before reaching for a UDF. 📥 Want more code snippets, job updates, and premium notes? 📢 𝗪𝗵𝗮𝘁𝘀𝗔𝗽𝗽 𝗖𝗵𝗮𝗻𝗻𝗲𝗹 https://lnkd.in/gh-PArM6 📲 𝗙𝗼𝗿 𝟭:𝟭 𝗰𝗼𝗻𝗻𝗲𝗰𝘁 https://lnkd.in/gi4Nvukd 🔗 𝐋𝐄𝐀𝐑𝐍𝐈𝐍𝐆 𝐑𝐄𝐒𝐎𝐔𝐑𝐂𝐄𝐒 🔥 𝐏𝐲𝐒𝐩𝐚𝐫𝐤 𝐌𝐚𝐬𝐭𝐞𝐫 𝐏𝐚𝐜𝐤 https://lnkd.in/gefBKgq5 🎟 Code: PYSPARK10 🔥 𝐂𝐨𝐦𝐩𝐥𝐞𝐭𝐞 𝐒𝐐𝐋 (𝐖𝐢𝐭𝐡 𝐃𝐖 & 𝐃𝐌) 𝐈𝐧𝐭𝐞𝐫𝐯𝐢𝐞𝐰 𝐏𝐚𝐜𝐤 https://lnkd.in/gABP4VzP 🎟 Code: SQL10 🔥𝐃𝐚𝐭𝐚𝐛𝐫𝐢𝐜𝐤𝐬 𝐄𝐧𝐝-𝐭𝐨-𝐄𝐧𝐝 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠 𝐏𝐚𝐜𝐤 https://lnkd.in/gQXyKy8U 🎟 Code: EARLYBIRDS15 🎯 𝐒𝐐𝐋 + 𝐏𝐲𝐒𝐩𝐚𝐫𝐤 + 𝐃𝐚𝐭𝐚𝐛𝐫𝐢𝐜𝐤𝐬 (𝟑-𝐢𝐧-𝟏 𝐁𝐮𝐧𝐝𝐥𝐞) https://lnkd.in/gy-MziZf 🎟 Code: DATAMASTERY10

18 Comments
Like Comment
To view or add a comment, sign in

996 followers

95 Posts

View Profile Connect

Prashant Uswadkar’s Post

More Relevant Posts

Explore content categories