🐍 Stop Writing "Spaghetti" Data Science Code We’ve all been there: a Jupyter Notebook with 47 cells, variables named df2, df_final, and df_final_v2_FIXED, and a loop that takes three hours to run. Data analysis is about insights, but your code quality determines how fast (and how reliably) you get them. Here are 4 Python best practices to move from "it works on my machine" to "production-ready." 1. Embrace Vectorization (Forget the for loops) If you’re iterating over a Pandas DataFrame with a loop, you’re likely doing it wrong. Python’s numpy and pandas are built on C—let them do the heavy lifting. Bad: Using .iterrows() to calculate a new column. Good: Use vectorized operations like df['new_col'] = df['a'] * df['b']. It’s orders of magnitude faster. 2. The Magic of Method Chaining Clean code is readable code. Instead of creating five intermediate DataFrames, chain your operations. It keeps your namespace clean and your logic linear. Python # Instead of multiple assignments, try this: df_clean = (df .query('age > 18') .assign(name=lambda x: x['name'].str.upper()) .groupby('region') .agg({'salary': 'mean'}) ) 3. Type Hinting & Docstrings Data types in Python are flexible, which is a blessing and a curse. Use Type Hints to tell your team exactly what a function expects. def process_data(df: pd.DataFrame) -> pd.DataFrame: It saves hours of debugging when someone tries to pass a list into a function expecting a Series. 4. Memory Management Matters Working with "Big-ish" data? Downcast your numerics (e.g., float64 to float32). Convert object columns with low cardinality to category types. Your RAM (and your IT department) will thank you. The Bottom Line: Great data analysis isn't just about the model accuracy; it's about the maintainability of the pipeline. Which Python habit changed your workflow the most? Let’s swap tips in the comments! 👇 #Python #DataScience #Pandas #DataAnalysis #CodingBestPractices #MachineLearning
Alan Oliveira’s Post
More Relevant Posts
-
Excited to share my latest article on modern data processing! I recently published "Polars: A High-Performance DataFrame Library in Python", where I dive into how Polars is emerging as a powerful alternative to traditional data manipulation libraries. As datasets continue to grow in size and complexity, performance becomes critical. In this article, I explore how Polars addresses these challenges with a highly efficient architecture built on Apache Arrow, enabling faster computation and reduced memory usage. Here’s what discuss in the article: ▪️ What Polars is and why it’s gaining traction in the data ecosystem ▪️ Its core design principles, including lazy execution, which optimizes queries before execution ▪️ Built-in parallel processing, allowing operations to run significantly faster compared to traditional approaches ▪️ How Polars handles large datasets more efficiently with lower memory overhead ▪️ Practical examples showcasing its performance benefits in real-world data workflows One of the most interesting aspects I found is how Polars shifts the mindset from step-by-step execution to an optimized query plan, making data pipelines not just faster, but smarter. If you're working in data science, data engineering, or analytics, and dealing with performance bottlenecks, Polars is definitely worth exploring. I’d love to hear your thoughts, have you tried Polars yet? How does it compare with your current tools? #Python #DataScience #BigData #Analytics #Polars #MachineLearning Read the full article here:
To view or add a comment, sign in
-
Unleash the power of data manipulation with Python 🐍📊 Understanding Pandas - the library that makes data analysis easy! 🚀 Pandas is a popular Python library used to manipulate structured data. It provides easy-to-use data structures and functions to work with relational and labeled data. Developers can efficiently clean, transform, and analyze data, making it essential for tasks like data cleaning, exploration, and preparation for machine learning models. 💡 Step 1: Import the Pandas library Step 2: Read data from a source Step 3: Perform data manipulation operations like filtering, grouping, and merging. Step 4: Analyze and visualize the data. 🖥️ Full code example 👇: import pandas as pd data = pd.read_csv('data.csv') data_filtered = data[data['column'] > 50] data_grouped = data.groupby('category')['column'].mean() print(data_filtered) print(data_grouped) 🔍 Pro tip: Use the .loc and .iloc methods for precise data selection. ❌ Common mistake to avoid: Forgetting to check for null values before performing operations can lead to errors. ❓ What's your favorite Pandas function for data analysis? Share your thoughts! 🌐 View my full portfolio and more dev resources at tharindunipun.lk #DataAnalysis #Python #Pandas #DataScience #CodeTips #DataManipulation #DeveloperCommunity #TechTalk #DataAnalytics #DataVisualization
To view or add a comment, sign in
-
-
Pandas is an open-source Python library used for data manipulation and analysis. It provides high-performance data structures and tools for working with structured (tabular) data, making it a cornerstone for data science and machine learning workflows. While NumPy arrays are powerhouse tools for numerical computation, they struggle with a core reality of data: real-world data is messy. It has missing values, mixed types (strings next to floats!), and requires complex joins or grouping. Enter **pandas** and the **DataFrame**. 🐼 Why pandas is the "Gold Standard" for Flat Files: 1. Heterogeneous Data: Unlike matrices, DataFrames handle different data types across columns simultaneously. 2. R-Style Power in Python: As Wes McKinney intended, pandas allows you to stay in the Python ecosystem for your entire workflow from munging to modeling without switching to domain-specific languages like R. 3. Wrangling at Scale: It’s "missing-value friendly." Whether you’re dealing with weird comments in a CSV or `NaN` values, pandas handles them gracefully during the import process. # The 3-Line Power Move: Importing a flat file is as simple as: ```python import pandas as pd # Load the data data = pd.read_csv('your_file.csv') # See the first 5 rows instantly print(data.head()) ``` The Big Takeaway: As Hadley Wickham famously noted: "A matrix has rows and columns. A data frame has observations and variables." In the world of Data Science, we aren't just looking at numbers; we’re looking at **observations**. Using `pd.read_csv()` isn't just a shortcut it’s best practice for building a robust, reproducible data pipeline. #DataEngineering #Python #Pandas #DataAnalysis #MachineLearning
To view or add a comment, sign in
-
-
Wide format. Long format. If you have worked with data in Python you have needed to convert between them constantly. pandas melt() and pivot() are the two functions that handle this and they are exact opposites of each other. melt() takes columns and turns them into rows — essential for feeding data into visualization libraries and statistical tools that expect long format. pivot() takes row values and turns them into columns — essential for building readable summary tables and reports. Understanding both, knowing when to use each, and knowing when to reach for pivot_table() instead of pivot() are the data wrangling fundamentals that make every downstream analysis cleaner and faster. Read the full post here: https://lnkd.in/eGcsiB5C #Python #Pandas #DataScience #DataAnalysis #DataEngineering #Analytics
To view or add a comment, sign in
-
🚀 Day 4/20 — Python for Data Engineering Reading & Writing Files (CSV / JSON) In data engineering, data rarely comes clean. 👉 It usually comes from: files logs exports APIs So the ability to read and write data is fundamental. 🔹 Why File Handling Matters We often: ingest raw data process it store cleaned output 👉 Python helps us do all of this easily. 🔹 Reading a CSV File import pandas as pd df = pd.read_csv("data.csv") print(df.head()) 👉 Loads structured data into a DataFrame 🔹 Reading a JSON File import json with open("data.json") as f: data = json.load(f) print(data) 👉 Useful for API responses and semi-structured data 🔹 Writing Data to a File df.to_csv("output.csv", index=False) 👉 Save processed data for further use 🔹 Where You’ll Use This Data ingestion pipelines Data transformation workflows Exporting results Logging and backups 💡 Quick Summary Python allows you to: read data from multiple formats process it write it back efficiently 💡 Something to remember Data engineering starts with reading data… and ends with writing it in a better form. #Python #DataEngineering #DataAnalytics #LearningInPublic #TechLearning #Databricks
To view or add a comment, sign in
-
-
Learn how to spot hidden bottlenecks in your Pandas code and avoid costly row-wise operations — Ibrahim Salami shows us a better (and faster) way in his new Python tutorial.
To view or add a comment, sign in
-
Filtering rows in pandas is one of the first skills every data scientist needs to master and there are more ways to do it than most beginners realize. Boolean indexing is the foundation. isin() replaces messy OR chains. between() cleans up range filters. loc[] handles filtering and column selection together. query() makes complex conditions readable at a glance. Each method has its place. Knowing which one to reach for in which situation is what makes your data analysis code clean, efficient, and easy to maintain. Read the full post here: https://lnkd.in/eRnVAxN4 #Python #Pandas #DataScience #DataAnalysis #DataEngineering #Analytics
To view or add a comment, sign in
-
https://lnkd.in/gmFmfZR9 This article raises some great points about how inefficient but technically correct code can prevent one's code from performing and looking like what the programmer envisioned. Instead of your Pandas code looking and feeling as sleek and smooth as a Lambo, it looks and runs like a Ford Pinto in need of desperate repairs. #Pandas #Python #DataAnalytics #DataAnalysis
To view or add a comment, sign in
-
A week ago I had the opportunity to give one of my first technical presentations. The topic? Pandas 3.0 – a library I was only vaguely familiar with before. I used the prep time to really dive into the details. For anyone who wants a quick summary of what changed and why in the world’s most popular data analysis library, here are the three big ones: 1️⃣ StringDType – A dedicated, optimized string type that finally replaces object arrays for text. Why? Previously, string columns were stored as Python objects in memory, which was slow and inefficient. StringDType uses a PyArrow (the new dependency) representation, making operations on text data significantly faster and more memory-efficient. An interesting nugget is that now Strings columns can give the exact size of the table (as seen by df.info()) as Pandas will not have to go into Python memory to fetch objects’ size. 2️⃣ pd.col – A clean way to refer to columns in methods like assign() or groupby(). Why? Before, you had to use string column names or workarounds that could break with complex expressions. An example I gave is possible error that might arise from lambda expressions since they pass values by reference while pd.col does so by value. pd.col provides a clear, explicit, and IDE-friendly way to reference columns, making code more readable and less error-prone. 3️⃣ Copy‑on‑Write (CoW) – Safer and more predictable. Slices no longer silently mutate the original. Why? Historically, Pandas would sometimes modify the original DataFrame when you changed a slice – a common source of subtle bugs and warnings (namely SettingWithCopyWarning). CoW ensures that modifications only affect the intended copy, making code behave more intuitively and eliminating "silent mutation" surprises. Of course, speed is not hindered unnecessarily by ensuring that copies are created only after a write operation is detected, reads will work with references before that. After the presentation, I did something uncomfortable but invaluable: I watched the recording of myself. And that was definitely a wake-up call. It gave me more insight than any external feedback ever could. I took notes on things I want to work on: - My energy was a bit too playful/cheerful at times, which undercut the technical depth. - I rushed through the introduction because I thought “everyone knows what Pandas is” – but if I decide to include it, I should own it, not skip it. - A small physical habit (lifting up my glasses) became a distraction on camera. - I filled every silence with “uhm” – when just a pause would have been more confident. None of this was easy to watch. But it was the most honest feedback I’ll ever get. Presenting is a skill, not a talent. And the only way to improve is to watch yourself do it, cringe, and take notes. If you’ve never reviewed a recording of yourself presenting – try it. It’s humbling. And incredibly useful. #PublicSpeaking #Pandas30 #DataScience #Python
To view or add a comment, sign in
-
🚀 Day 342 of solving 365 medium questions on LeetCode! 🔥 Today’s challenge: “3653. XOR After Range Multiplication Queries I” ✅ Problem: You are given an integer array nums and a list of queries. Each query provides a starting index l, an ending index r, a step size k, and a multiplier v. For each query, you must multiply the elements in the range from l to r by v (modulo 10^9 + 7), stepping by k each time. Return the final bitwise XOR of all elements in the array after all queries are processed. ✅ Approach (Array Simulation) Since this is the first version of the problem ("Queries I"), the constraints allow for a direct simulation approach! Apply Queries: I iterate through each query, unpacking the variables l, r, k, and v. I use a nested loop with Python's built-in range(l, r + 1, k) to perfectly handle the specific step logic required. Modulo Math: For each target index i in that hopped sequence, I multiply the current value nums[i] by v and immediately apply the modulo self.MOD (which is 10^9 + 7) to prevent massive integer overflows during subsequent queries. The XOR Sum: Once all queries are completely processed and the array is finalized, I initialize a res = 0 variable. A final, simple pass through the nums array applies the bitwise XOR operator (^=) to accumulate and return the final answer. ✅ Key Insight Python's range function with a step argument makes array-hopping logic beautifully concise. Instead of writing a messy while loop to manually track and increment the index by k, a single for loop naturally handles the boundaries and the exact hops in one clean, highly readable line! ✅ Complexity Time: O(Q \times \frac{N}{K} + N) — Where Q is the number of queries, N is the length of the array, and K is the step size. In the worst-case scenario, we iterate over segmented portions of the array for each query, followed by one final O(N) pass to compute the XOR sum. Space: O(1) — We modify the given nums array strictly in-place and only use a single integer variable (res) for the final calculation, requiring zero extra auxiliary data structures. 🔍 Python solution attached! 🔥 Flexing my coding skills until recruiters notice! #LeetCode365 #Simulation #BitManipulation #Arrays #Python #ProblemSolving #DSA #Coding #SoftwareEngineering
To view or add a comment, sign in
-
Explore related topics
- Clean Code Practices For Data Science Projects
- Best Practices for Writing Clean Code
- How to Improve Code Maintainability and Avoid Spaghetti Code
- Coding Best Practices to Reduce Developer Mistakes
- Writing Functions That Are Easy To Read
- Ways to Improve Coding Logic for Free
- How to Improve Your Code Review Process
- How to Refactor Code Thoroughly
- How to Write Clean, Collaborative Code
- Writing Clean Code for API Development
Explore content categories
- Career
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Hospitality & Tourism
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development