I recently worked on an ETL pipeline built around UK regional carbon intensity data. The pipeline extracts 24-hour regional data from the Carbon Intensity API, transforms the nested JSON response into a structured tabular format, aggregates the 30-minute interval readings into daily regional summaries, and loads the output into PostgreSQL for analysis. On the transformation side, the workflow flattens both the carbon intensity values and the generation mix data across fuel sources, then uses Pandas to produce daily region-level metrics. On the database side, the final output is stored in PostgreSQL tables designed for reporting, with date-based partitioning applied to the fact tables to support cleaner storage management and better scalability as the data grows. The result is a query-ready pipeline that turns raw API data into structured daily carbon intensity and generation mix data that can be used for downstream analysis and reporting. Tech used: Python, Pandas, PostgreSQL, SQLAlchemy, YAML #DataEngineering #ETL #Python #PostgreSQL #SQL #DataPipeline #DatabaseDesign #AnalyticsEngineering
ETL Pipeline for UK Carbon Intensity Data with Python and PostgreSQL
More Relevant Posts
-
I recently built a data pipeline that automatically tracks and visualizes real-time weather data. The project follows an ELT (Extract, Load, Transform) workflow to keep data moving quickly and accurately from the source to the final dashboard. 𝗛𝗼𝘄 𝗶𝘁 𝘄𝗼𝗿𝗸𝘀: • 𝗗𝗮𝘁𝗮 𝗖𝗼𝗹𝗹𝗲𝗰𝘁𝗶𝗼𝗻: A Python script pulls live weather data from an API every 5 minutes. • 𝗦𝘁𝗼𝗿𝗮𝗴𝗲: The raw data is immediately loaded into a PostgreSQL database. • 𝗖𝗹𝗲𝗮𝗻𝗶𝗻𝗴 𝗮𝗻𝗱 𝗦𝗼𝗿𝘁𝗶𝗻𝗴: I use dbt to transform raw data into structured tables for analysis: • 𝘀𝘁𝗴_𝘄𝗲𝗮𝘁𝗵𝗲𝗿_𝗱𝗮𝘁𝗮: The staging table where raw API data is cleaned, validated, and prepared for further processing. • 𝘄𝗲𝗮𝘁𝗵𝗲𝗿_𝗿𝗲𝗽𝗼𝗿𝘁: A refined table designed for real-time monitoring with clear, analysis-ready weather insights. • 𝗱𝗮𝗶𝗹𝘆_𝗮𝘃𝗲𝗿𝗮𝗴𝗲: An aggregated table that summarizes daily weather metrics to track trends over time. • 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗼𝗻: Apache Airflow orchestrates the entire process. • 𝗟𝗶𝘃𝗲 𝗗𝗮𝘀𝗵𝗯𝗼𝗮𝗿𝗱: Apache Superset displays results with a 5-minute auto-refresh. • 𝗦𝗲𝘁𝘂𝗽: Fully containerized using Docker for easy deployment. 𝗞𝗲𝘆 𝗙𝗲𝗮𝘁𝘂𝗿𝗲𝘀: • 𝗡𝗲𝗮𝗿-𝗥𝗲𝗮𝗹-𝗧𝗶𝗺𝗲: Data updates every 5 minutes. • 𝗥𝗲𝗹𝗶𝗮𝗯𝗹𝗲: Prevents duplicates and ensures high-quality data. • 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝘁: ELT enables scalable transformations inside the database. This project helped me build a complete, automated data system from scratch. #DataEngineering #ELT #Python #SQL #Airflow #Docker #DataPipeline #WeatherUpdate
To view or add a comment, sign in
-
🚀 Built an End-to-End Data Pipeline using API & SQL Server! Excited to share my recent hands-on project where I built a complete data pipeline from scratch 👇 🔹 What I did: 1. Source Database (SQL Server) ↓ 2. Create API using FastAPI ↓ 3. Expose endpoint (/data) ↓ 4. Call API using Python (requests) ↓ 5. Get data in JSON format ↓ 6. Connect to Target SQL Server ↓ 7. Auto-create table (if not exists) ↓ 8. Insert data into target table ↓ 9. Verify data in SSMS 🔹 Tech Stack: Python | FastAPI | SQL Server | pyodbc | requests 🔹 Key Learnings: 💡 How APIs act as a bridge between systems 💡 Converting JSON data into structured format 💡 Building real-world ETL pipelines 💡 Automating data movement without manual intervention This project helped me understand how real-world data engineering pipelines work — from data extraction to loading 🚀 Looking forward to building more such projects and improving my skills! #DataEngineering #Python #FastAPI #SQLServer #ETL #DataPipeline #LearningInPublic #100DaysOfData #BuildingInPublic
To view or add a comment, sign in
-
-
🚀 Built an End-to-End Data Pipeline using API, Python & SQL Server! Excited to share a hands-on project where I implemented a complete data pipeline across two systems 💻 🔹 Project Overview: ✔ Extracted data from PostgreSQL (Laptop 1) ✔ Exposed data via Django API (JSON format) ✔ Accessed API from another machine (Laptop 2) ✔ Converted JSON → CSV using Python (pandas) ✔ Dynamically created table (no manual schema!) ✔ Loaded data into SQL Server using pyodbc 🔹 Architecture: PostgreSQL → Django API → JSON → Python → CSV → SQL Server 🔹 Key Learnings: 💡 API as a bridge between systems 💡 Handling JSON data in real-world scenarios 💡 Automating schema creation 💡 Cross-machine data transfer 💡 Building end-to-end ETL pipelines This project gave me practical exposure to how modern data pipelines work in real-world data engineering 🚀 Looking forward to building more scalable and production-ready pipelines! #DataEngineering #Python #SQLServer #FastAPI #Django #ETL #DataPipeline #APIs #LearningInPublic #100DaysOfCode
To view or add a comment, sign in
-
-
New project unlocked🔓 I just finished building a 𝗖𝘂𝘀𝘁𝗼𝗺𝗲𝗿 𝗟𝗶𝗳𝗲𝘁𝗶𝗺𝗲 𝗩𝗮𝗹𝘂𝗲 (𝗖𝗟𝗩) 𝗣𝗿𝗲𝗱𝗶𝗰𝘁𝗶𝗼𝗻 𝗦𝘆𝘀𝘁𝗲𝗺. The starting question: 𝘩𝘰𝘸 𝘮𝘶𝘤𝘩 𝘳𝘦𝘷𝘦𝘯𝘶𝘦 𝘸𝘪𝘭𝘭 𝘦𝘢𝘤𝘩 𝘤𝘶𝘴𝘵𝘰𝘮𝘦𝘳 𝘨𝘦𝘯𝘦𝘳𝘢𝘵𝘦 𝘰𝘷𝘦𝘳 𝘵𝘩𝘦𝘪𝘳 𝘭𝘪𝘧𝘦𝘵𝘪𝘮𝘦 𝘪𝘯 𝘰𝘶𝘳 𝘣𝘶𝘴𝘪𝘯𝘦𝘴𝘴? Using the PostgreSQL DVD Rental dataset, I built an end-to-end pipeline: - Designed an ETL pipeline that processes ~14,000 transactions from 9 tables into a customer-level OLAP star schema - Engineered RFM-based features (Recency, Frequency, Monetary) for CLV modeling - Trained and compared multiple ML models (Linear Regression, Random Forest, Gradient Boosting) using chronological split and TimeSeriesSplit to avoid data leakage - Deployed everything into an interactive Django web app with a prediction form and business recommendations - The final model (Gradient Boosting) achieved strong performance, with R² close to 0.99 and low prediction error. One insight that came out of the analysis: customers who rent frequently, even at lower spend per transaction, often generate more lifetime value than occasional high spenders. Frequency matters more than monetary average! One limitation is that the dataset is static (historical DVD rental data), so the model reflects past behavior patterns rather than real-time customer activity. Additionally, some features like recency and tenure showed very low importance, likely due to the limited time range of the dataset, but they were still kept to ensures the model remains interpretable, aligned with business logic, and more generalizable to real-world scenarios beyond this dataset. This project helped me understand how data engineering, machine learning, and business thinking come together in a real system, not just a model. 🖇️GitHub → https://lnkd.in/g4k7iQuy Would love any feedback or thoughts!🖖🏻 #DataAnalytics #MachineLearning #Django #Python #PostgreSQL #PortfolioProject
To view or add a comment, sign in
-
💬 SQL Challenge of the Day 📝❓ Question Using the "Recursive CTEs" topic, write a SQL query to generate a Fibonacci sequence up to the 10th number. The Fibonacci sequence starts with 0 and 1, and each subsequent number is the sum of the two preceding numbers. 💡 Answer ```sql WITH RECURSIVE FibonacciCTE AS ( SELECT 0 AS n, 0 AS fib UNION ALL SELECT 1, 1 UNION ALL SELECT n + 1, CASE WHEN n = 0 THEN 0 WHEN n = 1 THEN 1 ELSE (SELECT fib FROM FibonacciCTE WHERE n = t.n - 1) + (SELECT fib FROM FibonacciCTE WHERE n = t.n - 2) END FROM FibonacciCTE t WHERE n < 10 ) SELECT fib AS Fibonacci_10th_Number FROM FibonacciCTE WHERE n = 9; ``` ✨ Explanation This query uses a recursive common table expression (CTE) to generate a Fibonacci sequence up to the 10th number. The CTE starts with the base cases of 0 and 1, then recursively calculates the Fibonacci numbers by summing the two previous numbers. The final result returns the 10th number in the Fibonacci sequence. 🛠️ Example (for ease of understanding) For the Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34 The query will output: ``` Fibonacci_10th_Number 34 ``` #Hashtags #PowerBIChallenge #PowerInterview #LearnPowerBi #LearnSQL #TechJobs #DataAnalytics #DataScience #BigData #DataAnalyst #MachineLearning #Python #SQL #Tableau #DataVisualization #DataEngineering #ArtificialIntelligence #CloudComputing #BusinessIntelligence #Data
To view or add a comment, sign in
-
Data quality checks across 10 columns = 10 queries? You don't need that. STRING_AGG handles everything in a single column 👇 Handling NULLs and inconsistencies can be a BIG headache in large datasets. But it doesn't have to be. Using STRING_AGG + UNION ALL + SUBQUERY, you can have all errors/ nulls pointed out in a single column. It works this way: 🔹Subquery: You place all the null/ sanity checks in the subquery 🔹UNION ALL: stacks all the errors by key (customers in the example below). In case of multiple columns with error, each customer will have multiple rows in the subquery table 🔹 STRING_AGG: Collapses the errors pointed out in the subquery into a single column. If there was only one error column it will bring it, if there was no error, it will be 'NULL' ⚠️NOTE: STRING_AGG may not work the same across engines. It's supported by engines like PostgreSQL, BigQuery and Redshift. While Sqlite uses GROUP_CONCAT, but the idea tends to be the same. ⚠️NOTE 2:The order of errors within the cell may vary depending on the database engine ⚠️ NOTE 3: This solution assumes no duplicate keys in your dataset. If duplicates exist, errors may repeat within the cell. Consider removing duplicates first. Which tricks do you use to ensure data quality? Leave it in the comments 👇 📌Save it and never waste time hunting down errors anymore. #SQLTips #DataAnalytics #DataScience #SQL #DataPipeline #DataQuality #DataEngineer #Python
To view or add a comment, sign in
-
-
I spent the last few weeks building something I'm genuinely proud of. It started with a simple question: what does a production-style data pipeline actually look like when you build it from scratch? So I built one. 𝐎𝐩𝐬𝐏𝐮𝐥𝐬𝐞-𝐍𝐘𝐂-𝐓𝐚𝐱𝐢-𝐏𝐢𝐩𝐞𝐥𝐢𝐧𝐞 — a modular ETL pipeline that pulls NYC Yellow Taxi trip data, cleans it, transforms it, and loads it into a SQL Server database for analysis. Here's what I learned along the way: → Clean architecture isn't optional. When your pipeline breaks at 2am, you'll thank yourself for writing modular code. → The pipeline fails loudly, not silently. HTTP errors, missing values, duplicates — nothing slips through quietly. Because bad data that goes unnoticed is worse than a pipeline that stops. → Logging is your best friend. If you can't observe it, you can't debug it. → A fail-fast strategy saves hours. If extract fails, nothing else runs. Simple. Brutal. Effective. Tech I used: Python · Pandas · Parquet · MSSQL Server · Requests · Custom logging The pipeline has 3 stages: Extract → you enter a month and year, the pipeline fetches the exact Parquet file for that period — no hardcoding, no manual downloads Transform → deduplicates, cleans nulls, engineers features, aggregates revenue per day Load → writes structured, clean data directly into MSSQL Server — query-ready from day one GitHub link in the comments 👇 #DataEngineering #ETL #Datapipeline #Python #MSSQL #DataWarehouse #LearningInPublic
To view or add a comment, sign in
-
Shipped an end-to-end project I’ve been building: Incremental PySpark ETL Pipeline with Audit Logging 🚀 Built this to move beyond one-time ETL scripts and think more in terms of production-style data pipelines. One concept I particularly enjoyed implementing was idempotent incremental loading - detecting already processed user_ids and loading only net-new records during reruns. GitHub Repo: https://lnkd.in/dE5sNvxa Would appreciate feedback from Data Engineers and fellow builders. #DataEngineering #PySpark #ApacheSpark #ETL #Python #DataPipelines #GitHubProjects
To view or add a comment, sign in
-
-
🚀 Excited to share my recent learning on ETL (Extract, Transform, Load)! Over the past few days, I’ve been exploring how ETL plays a crucial role in data analytics by enabling efficient data integration from multiple sources. ETL involves extracting raw data, transforming it into a clean and structured format, and loading it into systems for analysis and reporting. I also gained hands-on understanding of how ETL processes are implemented using tools and technologies like Python, SQL, and Excel for data cleaning, transformation, and pipeline creation. This process is essential for ensuring data quality, consistency, and reliability in real-world analytics workflows. Looking forward to applying these concepts in building efficient data pipelines and deriving meaningful insights from data. #DataAnalytics #ETL #DataEngineering #Python #SQL #LearningJourney
To view or add a comment, sign in
-
Explore related topics
Explore content categories
- Career
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Hospitality & Tourism
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development