Mohamed Boughattas’ Post

🐋 Meet Narwhals : the missing layer in your data stack! If you’ve ever struggled with switching between pandas, Polars, PySpark or DuckDB, this is for you 👉 Narwhals is a lightweight, extensible compatibility layer that lets you write dataframe-agnostic code once, and run it everywhere. 💡 Why it’s a game changer: ✅ Write code once, run across multiple dataframe engines ✅ Zero dependencies, stays lightweight ✅ Familiar API inspired by Polars ✅ Works with both eager and lazy execution ✅ Keeps native performance, no heavy abstraction cost 🔥 Imagine building libraries or pipelines that: • Automatically adapt to pandas, Polars, or Spark • Scale from local to distributed without rewrites • Stay clean, typed, and maintainable This is a big step toward true interoperability in the Python data ecosystem. If you're building data tools, this is definitely worth exploring. 📚 Learn more: https://lnkd.in/esq7vwJi 🔗 GitHub repo: https://lnkd.in/eMBNQ8eQ #DataEngineering #Python #DataScience #BigData #Analytics #OpenSource #AI #MachineLearning

To view or add a comment, sign in

More Relevant Posts

InnoVirtuoso | Technology, AI, Cybersecurity Blog

112 followers
2w
Report this post
⚡️ Still waiting minutes for Pandas to crunch your data? There’s a faster way! If you’re tired of slowdowns in your Python workflow, you’ll love our latest breakdown of how **Polars** is turbocharging data handling for teams everywhere. Uncover how you can effortlessly upgrade your old Pandas code, instantly *speed up* analyses, and unlock new productivity. Ready to make your data work for you—not against you? Read the full article and join the next-gen data revolution: https://lnkd.in/dW3debwY

Shared via InnoVirtuoso https://innovirtuoso.com
Like Comment
To view or add a comment, sign in
Shital Najan
6d
Report this post
🚀 Day 25 of My Data Engineering Learning Journey Today I focused on strengthening my fundamentals with Pandas and understanding how it compares with PySpark. 🔹 Learned the basics of Pandas – DataFrames, Series, data selection, filtering, and basic transformations. It’s very powerful for handling small to medium-sized datasets efficiently. 🔹 Explored the difference between PySpark and Pandas: - Pandas works on a single machine and is best for smaller datasets. - PySpark works in a distributed environment and is ideal for handling big data. - Pandas is faster for small data, while PySpark scales better for large data processing. 💡 Key takeaway: Choosing between Pandas and PySpark depends on data size and use case use Pandas for quick analysis and PySpark for large-scale data processing. #Day25 #Pandas #PySpark #DataEngineering #BigData #Python #LearningJourney
Like Comment
To view or add a comment, sign in
Nikhil Awadhwal
1w
Report this post
📊 Pandas: The Backbone of Data Analysis in Python From raw data to meaningful insights — that’s the real power of Pandas. 🚀 Whether you’re cleaning messy datasets, exploring patterns, or building data-driven solutions, Pandas makes everything faster, simpler, and more intuitive. 🔹 Handle missing data effortlessly 🔹 Work with multiple file formats (CSV, Excel, SQL) 🔹 Perform powerful data manipulation & aggregation 🔹 Apply custom functions with ease 💡 What I love most? Turning complex, unstructured data into clean, structured insights that actually drive decisions. If you’re stepping into Data Analytics or Data Science, mastering Pandas is not optional — it’s essential. #DataAnalytics #Python #Pandas #DataScience #LearningJourney #DataVisualization #AI #TechSkills
3 Comments
Like Comment
To view or add a comment, sign in
Mukesh Boolani
2w
Report this post
I finally understand why data scientists say they spend 80% of their time on data. 📊 This week, instead of just reading about the ML lifecycle, I actually did the second step: Data Collection. 🎯 I built my own dataset called "TMDB Top Rated Movies" using their public API. 🎬 It was interesting to see how data can come from different sources some datasets are already available in formats like CSV and JSON, while others can be retrieved using SQL databases. I also learned that data can be collected through APIs or even web scraping depending on the use case. Nothing fancy. Just: 🐍 Python 📡 A bunch of API calls 🔄 Figuring out how to loop through pages without breaking everything In the end, I pulled together 10,000+ movie records clean, structured, and ready for actual analysis or ML. 📁✅ This part felt more like real engineering than anything I have done in a notebook. 🛠️ Small step. But it's real. 🚀 dataset link: https://lnkd.in/dG7EcE5q #MachineLearning #DataScience #Python #LearningByDoing
1 Comment
Like Comment
To view or add a comment, sign in
Arraxis

5 followers
4w Edited
Report this post
𝑴𝒐𝒔𝒕 𝒄𝒐𝒎𝒑𝒂𝒏𝒊𝒆𝒔 𝒔𝒕𝒐𝒓𝒆 𝒕𝒉𝒆𝒊𝒓 𝒅𝒂𝒕𝒂 𝒕𝒉𝒆 𝒘𝒓𝒐𝒏𝒈 𝒘𝒂𝒚. 𝑯𝒆𝒓𝒆'𝒔 𝒘𝒉𝒚 𝒊𝒕 𝒎𝒂𝒕𝒕𝒆𝒓𝒔. When you work with data in Python, you're likely using pandas. And pandas made a very deliberate choice: it stores data in 𝐜𝐨𝐥𝐮𝐦𝐧𝐬, not rows. This isn't a technical detail. It has real consequences for your team's speed and infrastructure costs. 𝐑𝐨𝐰 𝐬𝐭𝐨𝐫𝐚𝐠𝐞 (𝐡𝐨𝐰 𝐉𝐒𝐎𝐍 𝐰𝐨𝐫𝐤𝐬): Every record is a self-contained dictionary. Great for APIs and transactional systems — you always grab the full object. 𝐂𝐨𝐥𝐮𝐦𝐧𝐚𝐫 𝐬𝐭𝐨𝐫𝐚𝐠𝐞 (𝐡𝐨𝐰 𝐩𝐚𝐧𝐝𝐚𝐬 𝐰𝐨𝐫𝐤𝐬): Every column is a contiguous list. All ages together. All names together. All cities together. Why does this matter in practice? → 𝐒𝐩𝐞𝐞𝐝. When you calculate the average age of your customers, columnar storage loops over a single array of integers in memory. Row storage has to dig into each individual record, one by one. The difference at scale is enormous. → 𝐌𝐞𝐦𝐨𝐫𝐲. In row storage, the key "age" is repeated for every single row. In columnar storage, it's stored once. With millions of records, this adds up fast. → 𝐕𝐞𝐜𝐭𝐨𝐫𝐢𝐳𝐚𝐭𝐢𝐨𝐧. NumPy can apply operations to an entire column at C-level speed. With row-oriented data, you're stuck with Python-level loops. → 𝐂𝐨𝐦𝐩𝐫𝐞𝐬𝐬𝐢𝐨𝐧. Columns compress beautifully because similar values live next to each other. This is why formats like Parquet are so efficient for storage and I/O. The rule of thumb: - Building APIs or handling transactions? 𝐑𝐨𝐰-𝐨𝐫𝐢𝐞𝐧𝐭𝐞𝐝 𝐢𝐬 𝐟𝐢𝐧𝐞. - Running aggregations, filters, ML pipelines, or any analytical workload? 𝐂𝐨𝐥𝐮𝐦𝐧𝐚𝐫 𝐢𝐬 𝐭𝐡𝐞 𝐫𝐢𝐠𝐡𝐭 𝐭𝐨𝐨𝐥. If you're frequently converting pandas DataFrames back to JSON records (𝘥𝘧.𝘵𝘰_𝘥𝘪𝘤𝘵(𝘰𝘳𝘪𝘦𝘯𝘵='𝘳𝘦𝘤𝘰𝘳𝘥𝘴')), you're often leaving significant performance on the table. The data format you choose upstream shapes the cost and speed of every analysis downstream. Choose deliberately. At Arraxis, we help companies make practical decisions about how they store, structure, and use their data. #DataEngineering #Python #Pandas #DataStrategy #Analytics #BusinessIntelligence
Like Comment
To view or add a comment, sign in
R Kishore Reddy
6d
Report this post
𝗜𝗳 𝘆𝗼𝘂 𝘄𝗼𝗿𝗸 𝘄𝗶𝘁𝗵 𝗱𝗮𝘁𝗮, 𝘆𝗼𝘂 𝗸𝗻𝗼𝘄 𝘁𝗵𝗶𝘀 — 𝗽𝗵𝗼𝗻𝗲 𝗻𝘂𝗺𝗯𝗲𝗿𝘀 𝗮𝗿𝗲 𝗻𝗲𝘃𝗲𝗿 𝗰𝗹𝗲𝗮𝗻 Sometimes they come with spaces, sometimes with country codes, sometimes with special characters like “+”, “-”, or even brackets. And sometimes, they even come with .00 at the end because of how data is stored or exported. And if we don’t clean them properly, it becomes very difficult to use that data for analysis or communication. In Pandas, cleaning phone number columns is actually simple once you understand the approach. First, I usually convert the column to string format. This avoids unexpected issues, especially when numbers are stored as integers, floats, or mixed types. After that, the main step is removing unwanted characters. Using regular expressions, we can keep only digits and remove everything else — including .00, symbols, and spaces. For example: df['phone'] = df['phone'].astype(str).str.replace(r'[^0-9]', '', regex=True) This one line can handle most messy formats. One important step I always follow is standardizing the final output. No matter how the number comes, I take only the last 10 digits. This helps remove country codes like +91 and keeps the data consistent. Something like: df['phone'] = df['phone'].str[-10:] Next comes validation. Not every cleaned number is valid. Some may be too short or too long. So I often filter numbers based on length to make sure we only keep meaningful data. If needed, I also format the numbers again in a clean and readable way. What I learned from this is simple — data cleaning is not about writing complex code, it’s about thinking clearly about the problem. Once the logic is clear, Pandas makes the job very easy. Small steps like this make a big difference when working with large datasets. #DataScience #DataAnalytics #Python #Pandas #DataCleaning
Like Comment
To view or add a comment, sign in
Naman Sharma
2w
Report this post
I used to struggle with Pandas… Until I learned these 12 functions Now I use them almost daily for: ✔️ Cleaning messy datasets ✔️ Exploring data faster ✔️ Building efficient workflows If you’re working with data, these are NON-NEGOTIABLE: 🔹 read_csv() – Load data instantly 🔹 head() – Quick preview 🔹 info() – Understand structure 🔹 describe() – Summary stats 🔹 isnull() – Find missing values 🔹 dropna() – Remove missing records 🔹 fillna() – Handle nulls 🔹 groupby() – Powerful aggregations 🔹 sort_values() – Organize data 🔹 value_counts() – Frequency analysis 🔹 merge() – Combine datasets 🔹 apply() – Custom logic I’ve personally used these while working on data validation & analysis tasks — and they’ve made everything faster and cleaner. Which Pandas function do you use the most? Or which one are you learning next? 📌 Save this post — you’ll thank yourself later #Python #Pandas #DataAnalysis #DataScience #DataEngineering #Analytics #LearnPython #TechCareers
Like Comment
To view or add a comment, sign in
Malayaranjan Behera
4d Edited
Report this post
From cleaning messy data 🧹 to filtering 🔍, grouping 📊, and transforming raw information into meaningful insights 📈 — Pandas is truly a powerful tool for every data enthusiast. Every dataset has a story 📖, and Pandas helps us understand it better 💡 Excited to keep learning and growing in Data Analytics & Data Science 🚀 #Python 🐍 #Pandas 🐼 #DataAnalytics 📊 #DataScience 💻 #MachineLearning 🤖 #LearningJourney 🌱
Like Comment
To view or add a comment, sign in
Aditya Bhardwaj
3w Edited
Report this post
Python libraries: your essential toolkit for modern data science and engineering. 🔹 Core & Visualization: pandas, NumPy, Matplotlib. 🔹 Wrangling & Analysis: pandas, NumPy, Matplotlib. 🌊 Streaming & Parallelism: PySpark, Kafka, Dask, Celery. 🔄 Orchestration & ETL: Apache Airflow, dbt. 🧠 Modeling & DL: TensorFlow, PyTorch. 🚀 APIs & BI: FastAPI, Tableau. Which library is most indispensable for you? #Python #DataScience #DataEngineering #MachineLearning #ETL
Like Comment
To view or add a comment, sign in
Manpreet Singh
3w
Report this post
Pandas is the workhorse of EDA, but it’s dangerously easy to write bad code. If your data exploration is slow, crashing your Jupyter notebook, or throwing endless warnings, you might be falling into one of these 5 common traps. Here are the biggest Pandas anti-patterns and how to fix them: 1. The "For-Loop" Trap (df.iterrows) ❌ The Mistake: Looping through rows to apply logic. It is painfully slow because it bypasses Pandas' C-backend. ✅ The Fix: Vectorization. Use np.where() or native Pandas math operations. They are optimized and run exponentially faster. 2. The .apply() Bottleneck ❌ The Mistake: Thinking .apply() is fast. It's often just a glorified, hidden for-loop under the hood. ✅ The Fix: Use built-in vectorized string (.str) or datetime (.dt) methods whenever possible. 3. Ignoring Memory Optimization ❌ The Mistake: Using pd.read_csv() on massive datasets without defining data types. Everything loads as float64 or object, eating up your RAM. ✅ The Fix: Downcast your types. Convert strings with low cardinality to category, and float64 to float32. 4. Chained Indexing (SettingWithCopyWarning) ❌ The Mistake: Subsetting data like this: df[df['A'] > 5]['B'] = 10. You don't know if you are modifying a view or a copy. ✅ The Fix: Always use .loc[] for assignments: df.loc[df['A'] > 5, 'B'] = 10. 5. Blindly Dropping Nulls ❌ The Mistake: Slapping .dropna() on your dataframe just to make the code run, destroying valuable data context. ✅ The Fix: Investigate why data is missing. Use .fillna(), interpolation, or treat "missing" as its own valuable category. Efficiency in EDA isn't just about saving time; it’s about writing scalable code that doesn't break in production. What is your biggest Pandas pet peeve? Let me know below! 👇 #DataScience #Python #Pandas #DataEngineering #MachineLearning #TechTips
8 Comments
Like Comment
To view or add a comment, sign in

4,363 followers

43 Posts

View Profile Connect

Mohamed Boughattas’ Post

More Relevant Posts

Explore content categories