I finally understand why data scientists say they spend 80% of their time on data. 📊 This week, instead of just reading about the ML lifecycle, I actually did the second step: Data Collection. 🎯 I built my own dataset called "TMDB Top Rated Movies" using their public API. 🎬 It was interesting to see how data can come from different sources some datasets are already available in formats like CSV and JSON, while others can be retrieved using SQL databases. I also learned that data can be collected through APIs or even web scraping depending on the use case. Nothing fancy. Just: 🐍 Python 📡 A bunch of API calls 🔄 Figuring out how to loop through pages without breaking everything In the end, I pulled together 10,000+ movie records clean, structured, and ready for actual analysis or ML. 📁✅ This part felt more like real engineering than anything I have done in a notebook. 🛠️ Small step. But it's real. 🚀 dataset link: https://lnkd.in/dG7EcE5q #MachineLearning #DataScience #Python #LearningByDoing
Building TMDB Top Rated Movies Dataset with Python
More Relevant Posts
-
📊 Beyond the Bell Curve: Handling "Messy" Data in Python As data scientists, we often dream of perfect, Gaussian (normal) distributions. But in the real world—especially with variables like car prices or housing data—the data is rarely "normal." I recently worked through a project involving Left-Skewed and Non-Parametric data. Here’s a breakdown of how I handled it using Python: 1️⃣ Identifying the Shape Before running any tests, I used Matplotlib to visualize the distribution. A high bin count (150) helped reveal a significant Left Skew, where the mean was being pulled down by a long tail of lower-priced entries. Python plt.hist(prices, bins=150) plt.show(); 2️⃣ The Transformation Strategy When data is left-skewed, standard parametric tests (like T-Tests) can become biased. To pull that "tail" back toward the center and achieve symmetry, I explored Square ($x^2$) and Cube ($x^3$) transformations. By stretching the right side of the distribution more than the left, these mathematical shifts can often "normalize" the data, allowing for more powerful statistical modeling. 3️⃣ When to Stay Non-Parametric If the data is truly "Non-Parametric" (multimodal or containing extreme gaps), forcing a transformation isn't the answer. In those cases, I pivot to Rank-Based tests like: ✅ Mann-Whitney U (instead of T-Test) ✅ Kruskal-Wallis (instead of ANOVA) ✅ Spearman’s Rank (instead of Pearson Correlation) The takeaway: Don't just import your library and hit "run." Understanding the geometry of your data is the difference between a biased model and an accurate insight. 💡 #DataScience #Python #Statistics #MachineLearning #Pandas #DataAnalytics #DataIntegrity
To view or add a comment, sign in
-
-
Most datasets don’t fail because of bad models. They fail because the data is messy. This is exactly where Pandas becomes a game changer. Instead of struggling with raw data, you can turn chaos into structure within seconds. Example: import pandas as pd data = { "name": ["A", "B", "C"], "marks": [85, 90, 78] } df = pd.DataFrame(data) print(df) Now imagine this with 10,000 rows. Cleaning, filtering, analyzing — all becomes manageable. What makes Pandas powerful? * Easy handling of tabular data * Built-in functions for cleaning * Fast filtering and grouping Reality check: In Data Science, most of your time is not spent building models. It is spent fixing data. Pandas doesn’t just help you analyze data. It helps you prepare it for real impact. #DataScience #Pandas #Python #DataAnalysis #LearningInPublic
To view or add a comment, sign in
-
-
🐍 Data Science tip: automate variable type detection before choosing your preprocessing strategy. One of the most overlooked steps in data preparation is correctly identifying the nature of each variable. Because imputation and transformation strategies depend entirely on variable type. Instead of guessing, you can systematically classify variables using simple Python logic: categorical = df.select_dtypes(include=['object', 'category']).columns numerical = df.select_dtypes(include=['int64', 'float64']).columns ordinal = [col for col in numerical if df[col].nunique() < 10] 💡 Then adapt your preprocessing strategy accordingly: Categorical → mode / encoding Numerical → mean or median Ordinal / discrete → careful handling (depends on context) 🔍 Key idea: Before choosing how to impute or transform data, you must first understand what type of variable you're working with. Good data science starts with structure, not models. #Python #DataScience #MachineLearning #DataEngineering #Pandas
To view or add a comment, sign in
-
Nobody talks about the quiet revolution that already happened in Python data tooling. Pandas was the default for years. Comfortable. Familiar. Everywhere. But in 2024–2025, something shifted. Here's what the modern Python data stack actually looks like now: → DuckDB for analytical queries on local files No server. No setup. Just SQL that runs faster than you expect directly on CSVs and Parquets. → Polars for dataframe operations Written in Rust. Built from scratch for multi-core CPUs. Lazy evaluation by default. On large datasets, it's not 2× faster than Pandas. It's often 10–50×. → Pandas is still useful. But mostly as a last step for compatibility, not for computation. The real insight here isn't the tools. It's the mental model. The old stack was: load → transform → analyze (all in Pandas). The new stack is: query first (DuckDB) → transform fast (Polars) → output clean (Pandas if needed). If you're still running df.groupby() on a 5M-row CSV in Pandas and wondering why your laptop fan is screaming this is for you. I wrote a deep dive on exactly this shift covering benchmarks, real code comparisons, and when to use which tool. Follow for more practical AI & data engineering content. What's your current go-to for data wrangling? Still Pandas, or have you made the switch? 👇 #Pandas #Python #DataScience #AI #DataCleaning
To view or add a comment, sign in
-
Nobody warns you about this when you start working with data. I once had a huge dataset with multiple subheaders, inconsistent formatting, and way too much going on. Honestly, I did not even know where to start. I spent so much time just trying to make sense of it before even writing a single line of analysis. And even after cleaning it, the work was not over. Understanding what the data is actually saying, digging through it, and finding meaningful insights...that is a whole different challenge. And it takes time. A lot of it. But when it finally clicked..when the data was clean, the insights made sense, and the dashboard actually came together, it felt like I had moved mountains. That is when I realized that the real work in data is not the fancy visualization at the end. It is everything that comes before it : cleaning, restructuring, understanding, and finding the story hidden in the numbers. That part does not get talked about enough. But honestly, that is where most of the learning happens. #DataAnalytics #Python #Pandas #DataVisualization #DashboardDesign
To view or add a comment, sign in
-
-
🎉 Happy Friday everyone! here is this week's round up of interesting data analytics news, libraries, articles and papers, enjoy! #dataanalytics #data #datascience #ai #ml #llm #dataenginering #python #pandas #gis 𝗖𝗵𝗮𝗻𝗴𝗲 𝗗𝗮𝘁𝗮 𝗖𝗮𝗽𝘁𝘂𝗿𝗲: 𝗦𝘁𝗼𝗽 𝗖𝗼𝗽𝘆𝗶𝗻𝗴 𝟱𝟬𝗠 𝗥𝗼𝘄𝘀 𝘁𝗼 𝗠𝗼𝘃𝗲 𝟱𝗞 𝗖𝗵𝗮𝗻𝗴𝗲𝘀 – an excellent comparison of three CDC patterns: timestamps, triggers, and log-based CDC ➡️ https://lnkd.in/gmTb5ftk 𝗙𝗼𝘂𝗻𝗱𝗮𝘁𝗶𝗼𝗻 𝗠𝗼𝗱𝗲𝗹-𝗗𝗿𝗶𝘃𝗲𝗻 𝗦𝗲𝗺𝗮𝗻𝘁𝗶𝗰 𝗖𝗵𝗮𝗻𝗴𝗲 𝗗𝗲𝘁𝗲𝗰𝘁𝗶𝗼𝗻 𝗶𝗻 𝗥𝗲𝗺𝗼𝘁𝗲 𝗦𝗲𝗻𝘀𝗶𝗻𝗴 𝗜𝗺𝗮𝗴𝗲𝗿𝘆 – an interesting paper using semantic change detection to track changes on the earth's surface ➡️ https://lnkd.in/gsNb6BHE 𝗖𝗹𝗮𝘂𝗱𝗲 𝗖𝗼𝗱𝗲’𝘀 𝗦𝗼𝘂𝗿𝗰𝗲 𝗚𝗼𝘁 𝗟𝗲𝗮𝗸𝗲𝗱. 𝗛𝗲𝗿𝗲’𝘀 𝗪𝗵𝗮𝘁’𝘀 𝗔𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝗪𝗼𝗿𝘁𝗵 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 – an interesting look at the 512,000 lines of TypeScript that make up a coding agent like Claude Code ➡️ https://lnkd.in/g-wRgf2W 𝗟𝗟𝗠 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲 𝗚𝗮𝗹𝗹𝗲𝗿𝘆 – a collection fo architectural diagrams, fact sheets, and technical reports of various LLM architectures ➡️ https://lnkd.in/gTNbgKPw 𝗪𝗵𝗮𝘁'𝘀 𝗻𝗲𝘄 𝗶𝗻 𝗽𝗮𝗻𝗱𝗮𝘀 𝟯 – an explanation of the real-world differences between pandas 3 and pandas 2 ➡️ https://lnkd.in/gW9AFasB
To view or add a comment, sign in
-
Just finished my first ever data science project through SRM Insider's AI/ML domain and it was a lot more interesting than I expected. The task was predicting hourly bike rental demand in Washington D.C. using 17000+ hours of real weather and calendar data. What surprised me most was that creating the right features mattered way more than picking a fancy model. Giving the model a memory of what happened an hour ago moved the needle more than any tuning did. Final result: 3x better than the baseline, average error down from 132 to 44 on unseen data. Biggest takeaway: EDA and understanding your data before building anything matters more than the model itself. Built with Python, XGBoost, Pandas and a lot of plot staring 😄 GitHub: https://lnkd.in/gP9MNnis #DataScience #MachineLearning #Python #XGBoost #SRMInsider
To view or add a comment, sign in
-
What if cleaning messy datasets took seconds instead of hours? 👀 🚀 I built an industrial-grade data cleaning tool that turns messy datasets into ML-ready data in seconds. While working with real-world datasets, I kept facing the same problem: ❌ messy columns ❌ missing values ❌ inconsistent formats ❌ hours wasted before even starting ML So I built DataForge Pro 👇 ⚙️ What it does: • Auto-cleans datasets (missing values, duplicates, types) • Detects & handles outliers (IQR / Z-score) • Converts messy strings like "$1,200" → numeric • Generates a full visual report (6 charts) • Gives an ML Readiness Score (0–100) 💡 Why this matters: Data scientists spend ~70–80% of time on cleaning. This tool reduces that to seconds. 🌐 Live Demo: https://lnkd.in/ggr8TjQK 📂 GitHub: https://lnkd.in/g6eSXaz2 📊 Built with: Python • Streamlit • pandas • scikit-learn This is just v1 — planning to add: → AI-powered cleaning suggestions → Polars for big data → REST API version Would love your feedback 🙌 Open to collaborations & improvements! #DataScience #Python #Streamlit #MachineLearning #OpenSource #BuildInPublic
To view or add a comment, sign in
-
Correlation tells you what moved together. Causal inference tells you what actually caused it. After this, you'll be able to estimate the true causal effect of any intervention : a promo, a product change, a policy shift - from observational data. No A/B test required. The technique: Propensity Score Matching (PSM) in Python. 𝗦𝘁𝗲𝗽 𝟭 :𝗜𝗻𝘀𝘁𝗮𝗹𝗹 ```bash pip install causalinference ``` 𝗦𝘁𝗲𝗽 𝟮 :𝗣𝗿𝗲𝗽𝗮𝗿𝗲 𝘆𝗼𝘂𝗿 𝗱𝗮𝘁𝗮 You need three columns: outcome Y, binary treatment D, and confounders X. ```python import pandas as pd df = pd.read_csv("observational_data.csv") Y = df["revenue"].values D = df["received_promo"].values # 1 = treated, 0 = control X = df[["age", "tenure", "spend_last_90d"]].values ``` 𝗦𝘁𝗲𝗽 𝟯 : 𝗕𝘂𝗶𝗹𝗱 𝗮𝗻𝗱 𝗿𝘂𝗻 𝘁𝗵𝗲 𝗺𝗼𝗱𝗲𝗹 ```python from causalinference import CausalModel model = CausalModel(Y, D, X) model.est_via_matching() print(model.estimates) ``` 𝗦𝘁𝗲𝗽 𝟰 : Read your results The key output is ATE (Average Treatment Effect) - the estimated causal lift, adjusted for selection bias. 📌 Always run `model.summary_stats` first. If treated and control groups don't overlap in propensity score distribution, your estimate is invalid — check covariate balance before trusting any number. The result: instead of "promo users had 23% higher revenue," you can say "the promo caused a £42 average revenue lift, controlling for age and prior spend." That's a claim your finance team can't easily dismiss. Have you applied causal inference in a real project? What's the hardest part to justify to non-technical stakeholders? #DataAnalytics #Data #Python #DataScience #Analytics #Statistics #CausalInference #BusinessIntelligence
To view or add a comment, sign in
-
-
𝗜𝗳 𝘆𝗼𝘂 𝘄𝗼𝗿𝗸 𝘄𝗶𝘁𝗵 𝗱𝗮𝘁𝗮, 𝘆𝗼𝘂 𝗸𝗻𝗼𝘄 𝘁𝗵𝗶𝘀 — 𝗽𝗵𝗼𝗻𝗲 𝗻𝘂𝗺𝗯𝗲𝗿𝘀 𝗮𝗿𝗲 𝗻𝗲𝘃𝗲𝗿 𝗰𝗹𝗲𝗮𝗻 Sometimes they come with spaces, sometimes with country codes, sometimes with special characters like “+”, “-”, or even brackets. And sometimes, they even come with .00 at the end because of how data is stored or exported. And if we don’t clean them properly, it becomes very difficult to use that data for analysis or communication. In Pandas, cleaning phone number columns is actually simple once you understand the approach. First, I usually convert the column to string format. This avoids unexpected issues, especially when numbers are stored as integers, floats, or mixed types. After that, the main step is removing unwanted characters. Using regular expressions, we can keep only digits and remove everything else — including .00, symbols, and spaces. For example: df['phone'] = df['phone'].astype(str).str.replace(r'[^0-9]', '', regex=True) This one line can handle most messy formats. One important step I always follow is standardizing the final output. No matter how the number comes, I take only the last 10 digits. This helps remove country codes like +91 and keeps the data consistent. Something like: df['phone'] = df['phone'].str[-10:] Next comes validation. Not every cleaned number is valid. Some may be too short or too long. So I often filter numbers based on length to make sure we only keep meaningful data. If needed, I also format the numbers again in a clean and readable way. What I learned from this is simple — data cleaning is not about writing complex code, it’s about thinking clearly about the problem. Once the logic is clear, Pandas makes the job very easy. Small steps like this make a big difference when working with large datasets. #DataScience #DataAnalytics #Python #Pandas #DataCleaning
To view or add a comment, sign in
-
Explore content categories
- Career
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Hospitality & Tourism
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development
Amazing dear, keep it up ❤️ 👏🏻