Sessions 1–5 of the Data Analyst Bootcamp with SQL & Python in Google Platform — Done! In the first five sessions of the Data Analyst bootcamp organized by DQLab, I learned SQL fundamentals using Google BigQuery. The topics covered include database basics, core SQL concepts, and the differences between DDL & DML. I also practiced building basic queries (SELECT) and working with clauses such as WHERE, ORDER BY, and GROUP BY, along with aggregation functions and HAVING. Additionally, I explored date functions, number formatting, and advanced techniques like subqueries, CTEs, and JOINs. Excited to continue with the next sessions and keep growing in data analytics 🚀 Huge thanks to Kak Shella Theresya Pandiangan for the guidance and insightful explanations throughout this journey 🙏 #DataAnalyst #SQL #GoogleBigQuery #DQLab #LearningJourney
Data Analyst Bootcamp with SQL & Python in Google Platform
More Relevant Posts
-
🚀 Day 56 – Data Collection in Data Science Today I focused on one of the most important steps in any Data Science project — Data Collection 📊 💡 Without quality data, even the best model won’t perform well. --- 🔹 Ways to Collect Data: ✔️ APIs Fetch structured and real-time data from servers using tools like Python requests ✔️ Web Scraping Extract data from websites using BeautifulSoup and Selenium ✔️ Databases Access stored data from SQL (MySQL, PostgreSQL) or NoSQL (MongoDB) ✔️ Open Datasets Use platforms like Kaggle for ready-made datasets ✔️ Surveys & Forms Collect custom data using tools like Google Forms ✔️ Logs & Tracking Analyze user behavior from website/app logs --- ⚖️ Key Insight: API → Clean & reliable data Scraping → Useful when API is not available --- 🔥 What I Realized: Data Collection is not just gathering data — it’s about collecting the right data for your problem --- 📈 Next Step: Moving towards Data Cleaning & Preprocessing --- #Day56 #DataScience #DataCollection #Python #MachineLearning #WebScraping #API #LearningJourney #100DaysOfCode
To view or add a comment, sign in
-
-
🔄 From Pandas to PySpark — One Cheat Sheet to Rule Them All! Navigating between different data tools can be overwhelming, especially when switching between Pandas, Polars, SQL, and PySpark. This handy comparison simplifies everyday data operations like: ✔ Reading data ✔ Filtering & sorting ✔ Joins & aggregations ✔ Handling missing values ✔ Grouping & transformations 💡 Whether you're a beginner in data analytics or transitioning into big data tools, understanding these parallels helps you: Learn faster 🚀 Work smarter 💡 Adapt across technologies 🔁 In today’s data-driven world, flexibility across tools is a superpower! 📌 Save this for quick reference and level up your data skills. #DataAnalytics #DataScience #Python #Pandas #PySpark #SQL #Polars #BigData #DataEngineering #Learning #CareerGrowth #AnalyticsJourney #DataTools
To view or add a comment, sign in
-
-
Day 65 of my Data Engineering journey 🚀 Today I learned about joining datasets in Spark — a critical operation in real-world data pipelines. 📘 What I learned today (Joins in Apache Spark): • Performing joins using join() • Types of joins: inner, left, right, outer • Joining on one or multiple keys • Handling duplicate columns after joins • Understanding shuffle during joins • Optimizing joins for large datasets • Broadcast joins for smaller tables • Comparing Spark joins with SQL joins In real data systems, data is spread across multiple sources. Joining datasets is how we bring everything together. But in Spark, joins are not just logic they are also about performance. Poor joins = slow pipelines. Optimized joins = scalable systems. Why I’m learning in public: • To stay consistent • To build accountability • To improve daily Day 65 done ✅ Next up: Spark optimization techniques 💪 #DataEngineering #BigData #ApacheSpark #Python #LearningInPublic #CareerGrowth #Consistency
To view or add a comment, sign in
-
People ask me why I use PySpark when SQL can do most things. Honest answer - SQL does do most things. If I can write a clean SQL transformation that runs in under a minute, I'm not reaching for Spark. But there's a point where SQL stops being the right tool and you feel it immediately. Multi-step transformations where intermediate results are too large to join cleanly. Data quality logic that needs conditional branching across 15 columns. Deduplication across 50 million records where a window function in SQL is timing out. That's when PySpark earns its place. The thing about PySpark that takes time to actually internalize - it's lazy. You can chain 10 transformations and nothing has run yet. Spark is building a plan. It only executes when you force it - a write, a count, a show. What this means in practice: the line that throws the error is almost never where the problem is. The error surfaces at execution. The bug is three transformations back. I spent more time than I'd like to admit debugging the wrong line before I understood this properly. The way I work through it now - add a count() after each major transformation step while debugging. Force execution at each stage. Slow to run but you isolate the problem in one pass instead of five. PySpark rewards the engineers who understand what's happening underneath. It punishes the ones who treat it like fast SQL. #dataengineering #pyspark #spark #python #etl #dataengineer #awsglue
To view or add a comment, sign in
-
🚀 Data Engineering Tip: Data Partition Pruning (Easy Performance Win) Want faster queries without changing much code? 👉 Learn Partition Pruning 👇 💡 What is it? Only the required partitions are scanned instead of the full dataset. 📊 Example: Table partitioned by date Query: 👉 Get data for 2026-04-01 ❌ Without pruning → scans full table ✅ With pruning → scans only that date partition 🎯 Why it matters: ✔️ Faster queries ✔️ Lower compute cost ✔️ Better performance in Spark, Hive, Snowflake 🛠️ Where it’s used: Spark | PySpark | Hive | Delta Lake | BigQuery 🚀 Tech Stack: Python | SQL | Spark | PySpark | Delta Lake | Kafka | Airflow 💡 Pro Tip: Always filter on partition columns in your queries 👉 Are you using partition pruning in your queries? Comment “YES” or “LEARNING” 👇 #DataEngineering #BigData #PerformanceTuning #Spark #SQL #DeltaLake #ETL #DataPipelines #TechLearning
To view or add a comment, sign in
-
💡 Data Engineering Tip: Idempotency (Build Pipelines That Don’t Break) One concept that separates beginners from experienced engineers 👇 👉 Idempotency = Same result, even if a job runs multiple times 📊 Why this matters: Pipelines can fail and rerun Duplicate data can break reports Real systems need reliability ❌ Without idempotency: Duplicate records, inconsistent data ✅ With idempotency: Clean, consistent, reliable outputs 🛠️ How to achieve it: ✔️ Use primary keys / unique constraints ✔️ Implement UPSERT (MERGE) logic ✔️ Deduplication strategies ✔️ Checkpoints in streaming (Kafka/Spark) 🚀 Tech Stack: Python | SQL | Spark | PySpark | Kafka | Airflow | Delta Lake | Snowflake 💡 Pro Tip: Always design pipelines assuming they will rerun 👉 Are your pipelines idempotent? Comment “YES” or “LEARNING” 👇 #DataEngineering #BigData #DataPipelines #ETL #Kafka #Spark #SQL #DataQuality #TechLearning #CareerGrowth
To view or add a comment, sign in
-
Love this! The data journey is a universal starting point, and everyone’s path evolves. My early wins came from solid SQL, data modeling, and real datasets to learn governance and quality. Don’t skip version control or documenting decisions—they pay off as projects scale. If you’re starting out, tackle small, impactful projects first. What’s your first data project idea? #DataCulture #DataEngineering #SQL #ETL #DataQuality
Almost every Data Engineer starts their journey with the PANDAS library. - And YKW? If you can work with Pandas, PySpark (coding) is just a piece of cake for you... BCZ, At the end of the day, you’re still working with DATAFRAMES. Most of the transformations are almost the same. For ex: ➡️ Filtering: - Pandas: df[df['age'] > 25] - PySpark: df.filter(df['age'] > 25) See, it's almost the same (And yeah, I’m talking about PySpark here - not the entire Spark ecosystem). ▶️ The Real Difference? Pandas is "Eager" (does it now), and PySpark is "Lazy" (waits until you ask for the result). Once you get that tiny mental shift down, you are basically a PySpark developer. If you’ve mastered the Pandas DataFrame, don't let the "Big Data" tag scare you off. You already have the foundation. 📍 Comment "Pyspark" if you want a full notebook with all the Transformations listed.
To view or add a comment, sign in
-
-
🚀 Day 6 of My Data Analyst Journey — Understanding Data with Dictionaries Today I learned one of the most important data structures in Python for data analysis: Dictionaries 📊 This is where data starts to look like real-world information (key → value pairs). 🧩 What I Learned: 🔹 Dictionaries in Python Creating & accessing key-value pairs Modifying data inside dictionaries Using dictionary methods Iterating over keys & values 🔹 Advanced Concepts Nested dictionaries (real-world data structure) Dictionary comprehensions Transforming and filtering data 💻 What I Practiced: Solved 13+ problems based on real data scenarios, including: Creating dictionaries with numbers & their squares Accessing specific keys & values Adding & removing elements Iterating through key-value pairs Creating cubes using dictionary comprehension Merging two dictionaries Working with nested dictionaries (student data) Creating dictionary of lists & tuples Converting dictionary → list of tuples Filtering even keys Swapping keys & values Using default dictionaries ⚙️ Key Realization: Dictionaries are the closest thing to real datasets in Python. They help represent: 👉 Student records 👉 Product data 👉 API responses 📈 Growth Check: Day 1 → Basics Day 2 → Conditions Day 3 → Control Flow Day 4 → Lists Day 5 → Tuples & Sets Day 6 → Dictionaries Now the foundation for data analysis is getting complete 📈 #DataAnalyticsJourney #PythonLearning #Day6 #DataStructures #LearnInPublic #FutureDataAnalyst
To view or add a comment, sign in
-
-
Day 67 of my Data Engineering journey 🚀 Today I worked with Spark SQL combining the simplicity of SQL with the power of distributed processing. 📘 What I learned today (Spark SQL with Apache Spark): • Creating temporary views from DataFrames • Running SQL queries on large datasets • Using spark.sql() for querying • Performing joins and aggregations using SQL • Comparing DataFrame API vs SQL approach • Writing cleaner and more readable queries • Leveraging SQL knowledge in Big Data systems • Choosing between SQL and DataFrame transformations Spark SQL makes it easier to work with data using familiar SQL syntax even at scale. Instead of writing complex code, you can express logic in simple SQL queries. Best part? It runs on distributed systems. SQL + Spark = powerful combination. Why I’m learning in public: • To stay consistent • To build accountability • To improve daily Day 67 done ✅ Next up: working with different file formats (Parquet, JSON, CSV) in Spark 💪 #DataEngineering #BigData #ApacheSpark #SQL #Python #LearningInPublic #CareerGrowth #Consistency
To view or add a comment, sign in
Explore related topics
Explore content categories
- Career
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Hospitality & Tourism
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development
Semangat Kak ☺️✨