Pandas stores data in columns, not rows, for speed and efficiency

5 followers

4w Edited

𝑴𝒐𝒔𝒕 𝒄𝒐𝒎𝒑𝒂𝒏𝒊𝒆𝒔 𝒔𝒕𝒐𝒓𝒆 𝒕𝒉𝒆𝒊𝒓 𝒅𝒂𝒕𝒂 𝒕𝒉𝒆 𝒘𝒓𝒐𝒏𝒈 𝒘𝒂𝒚. 𝑯𝒆𝒓𝒆'𝒔 𝒘𝒉𝒚 𝒊𝒕 𝒎𝒂𝒕𝒕𝒆𝒓𝒔. When you work with data in Python, you're likely using pandas. And pandas made a very deliberate choice: it stores data in 𝐜𝐨𝐥𝐮𝐦𝐧𝐬, not rows. This isn't a technical detail. It has real consequences for your team's speed and infrastructure costs. 𝐑𝐨𝐰 𝐬𝐭𝐨𝐫𝐚𝐠𝐞 (𝐡𝐨𝐰 𝐉𝐒𝐎𝐍 𝐰𝐨𝐫𝐤𝐬): Every record is a self-contained dictionary. Great for APIs and transactional systems — you always grab the full object. 𝐂𝐨𝐥𝐮𝐦𝐧𝐚𝐫 𝐬𝐭𝐨𝐫𝐚𝐠𝐞 (𝐡𝐨𝐰 𝐩𝐚𝐧𝐝𝐚𝐬 𝐰𝐨𝐫𝐤𝐬): Every column is a contiguous list. All ages together. All names together. All cities together. Why does this matter in practice? → 𝐒𝐩𝐞𝐞𝐝. When you calculate the average age of your customers, columnar storage loops over a single array of integers in memory. Row storage has to dig into each individual record, one by one. The difference at scale is enormous. → 𝐌𝐞𝐦𝐨𝐫𝐲. In row storage, the key "age" is repeated for every single row. In columnar storage, it's stored once. With millions of records, this adds up fast. → 𝐕𝐞𝐜𝐭𝐨𝐫𝐢𝐳𝐚𝐭𝐢𝐨𝐧. NumPy can apply operations to an entire column at C-level speed. With row-oriented data, you're stuck with Python-level loops. → 𝐂𝐨𝐦𝐩𝐫𝐞𝐬𝐬𝐢𝐨𝐧. Columns compress beautifully because similar values live next to each other. This is why formats like Parquet are so efficient for storage and I/O. The rule of thumb: - Building APIs or handling transactions? 𝐑𝐨𝐰-𝐨𝐫𝐢𝐞𝐧𝐭𝐞𝐝 𝐢𝐬 𝐟𝐢𝐧𝐞. - Running aggregations, filters, ML pipelines, or any analytical workload? 𝐂𝐨𝐥𝐮𝐦𝐧𝐚𝐫 𝐢𝐬 𝐭𝐡𝐞 𝐫𝐢𝐠𝐡𝐭 𝐭𝐨𝐨𝐥. If you're frequently converting pandas DataFrames back to JSON records (𝘥𝘧.𝘵𝘰_𝘥𝘪𝘤𝘵(𝘰𝘳𝘪𝘦𝘯𝘵='𝘳𝘦𝘤𝘰𝘳𝘥𝘴')), you're often leaving significant performance on the table. The data format you choose upstream shapes the cost and speed of every analysis downstream. Choose deliberately. At Arraxis, we help companies make practical decisions about how they store, structure, and use their data. #DataEngineering #Python #Pandas #DataStrategy #Analytics #BusinessIntelligence

To view or add a comment, sign in

More Relevant Posts

R Kishore Reddy
6d
Report this post
𝗜𝗳 𝘆𝗼𝘂 𝘄𝗼𝗿𝗸 𝘄𝗶𝘁𝗵 𝗱𝗮𝘁𝗮, 𝘆𝗼𝘂 𝗸𝗻𝗼𝘄 𝘁𝗵𝗶𝘀 — 𝗽𝗵𝗼𝗻𝗲 𝗻𝘂𝗺𝗯𝗲𝗿𝘀 𝗮𝗿𝗲 𝗻𝗲𝘃𝗲𝗿 𝗰𝗹𝗲𝗮𝗻 Sometimes they come with spaces, sometimes with country codes, sometimes with special characters like “+”, “-”, or even brackets. And sometimes, they even come with .00 at the end because of how data is stored or exported. And if we don’t clean them properly, it becomes very difficult to use that data for analysis or communication. In Pandas, cleaning phone number columns is actually simple once you understand the approach. First, I usually convert the column to string format. This avoids unexpected issues, especially when numbers are stored as integers, floats, or mixed types. After that, the main step is removing unwanted characters. Using regular expressions, we can keep only digits and remove everything else — including .00, symbols, and spaces. For example: df['phone'] = df['phone'].astype(str).str.replace(r'[^0-9]', '', regex=True) This one line can handle most messy formats. One important step I always follow is standardizing the final output. No matter how the number comes, I take only the last 10 digits. This helps remove country codes like +91 and keeps the data consistent. Something like: df['phone'] = df['phone'].str[-10:] Next comes validation. Not every cleaned number is valid. Some may be too short or too long. So I often filter numbers based on length to make sure we only keep meaningful data. If needed, I also format the numbers again in a clean and readable way. What I learned from this is simple — data cleaning is not about writing complex code, it’s about thinking clearly about the problem. Once the logic is clear, Pandas makes the job very easy. Small steps like this make a big difference when working with large datasets. #DataScience #DataAnalytics #Python #Pandas #DataCleaning
Like Comment
To view or add a comment, sign in
ZAKARIA ZEBBARA
3w Edited
Report this post
Data Engineering starts with robust Data Ingestion. 🕸️ If you are a data analyst relying on pre-packaged Kaggle datasets, you are missing out on the most valuable data available: the live web. However, writing web scrapers from scratch for every project is incredibly frustrating—between handling messy HTML, managing rate limits, and formatting the output, it's a massive time sink. I hate manual data entry, so I built a production-ready Python scraping script to automate the collection process. Instead of fighting with boilerplate code, this script handles the heavy lifting and directly exports clean, structured data into CSV or JSON formats, ready to be ingested into a database or analyzed in Pandas. #Python #DataEngineering #WebScraping #DataAnalytics #Automation
1 Comment
Like Comment
To view or add a comment, sign in
Aman Nim
3d
Report this post
If Excel feels limiting… Pandas is where data starts to listen to you. Most professionals know what to analyze— but struggle with how to handle messy data at scale. This visual breaks down why Pandas (Python) is a game-changer: 👉 It’s built for data manipulation & analysis 👉 Works across formats (CSV, Excel, SQL) 👉 Handles missing data, transformations, and aggregations seamlessly And it all revolves around two simple structures: ▸ Series → one-dimensional data ▸ DataFrame → table-like, rows + columns (your Excel on steroids) 💡 What you can actually do with Pandas: ▸ Read data from multiple sources ▸ Explore it quickly (head(), info(), describe()) ▸ Filter & select specific rows/columns ▸ Clean messy data (nulls, duplicates) ▸ Aggregate insights (groupby, sum, mean) ▸ Apply custom logic with functions 💡 Key Insight: Pandas isn’t just a tool—it’s a workflow: Load → Explore → Clean → Analyze → Output Master this flow, and you can handle almost any dataset. 🔧 Practical takeaway: Instead of jumping into dashboards immediately: ▸ Clean your data first ▸ Validate assumptions early ▸ Use Pandas to create a reliable dataset 📊 Real-world impact: Better preprocessing = faster dashboards, fewer errors, and stronger insights. 🚀 The best analysts don’t just visualize data… they prepare it right before it’s seen. #Python #Pandas #DataAnalytics #DataScience #DataCleaning #BusinessIntelligence #AnalyticsSkills
Like Comment
To view or add a comment, sign in
Karnulu Suresh
2w
Report this post
Headline: Stop wasting 4 hours on EDA. Do it in 4 lines of code. ⏳ Exploratory Data Analysis (EDA) is the most critical step in any data project, but let’s be honest—writing the same df.describe(), plt.scatter(), and sns.heatmap() code over and over is a soul-crushing time sink. In the industry, we use AutoEDA libraries to get 80% of the insights with 2% of the effort. 🚀 Here are my top 3 picks for your toolkit: 1️⃣ ydata-profiling (formerly Pandas Profiling): The "Gold Standard." It generates a massive, interactive HTML report covering correlations, missing values, and detailed stats for every column. 2️⃣ Sweetviz: The "Comparison King." Perfect for spotting Data Drift. If you need to see exactly how your Train set differs from your Test set, this is the tool. 3️⃣ AutoViz: The "Speed Demon." It automatically identifies the most important features and selects the best charts (Scatter, Box, Violin) for you. It’s incredibly fast, even on larger datasets. The Reality Check: ⚠️ Are these used for real-time streaming data? Usually, no. They are "batch" tools meant for the initial discovery phase or sanity-checking a new data dump. For live monitoring, you're better off with Grafana or Great Expectations. But for your next CSV or SQL export? Don't start from scratch. Automate the boring stuff so you can focus on the actual strategy. Which one is your go-to? Or are you still team Matplotlib/Seaborn for everything? 👇 #DataScience #Python #MachineLearning #Analytics #Efficiency #CodingTips
Like Comment
To view or add a comment, sign in
Rajeev Kumar
1w
Report this post
📌𝗣𝘆𝘁𝗵𝗼𝗻 𝗟𝗶𝘀𝘁 𝗠𝗲𝘁𝗵𝗼𝗱𝘀 — 𝘄𝗵𝗮𝘁 𝗮𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝗵𝗮𝗽𝗽𝗲𝗻𝘀 𝗯𝗲𝗵𝗶𝗻𝗱 𝘁𝗵𝗲 𝘀𝗰𝗲𝗻𝗲𝘀- We think: → 𝗮𝗽𝗽𝗲𝗻𝗱() returns a new list ❌ → 𝗰𝗼𝗽𝘆() creates a deep copy ❌ → 𝘀𝗼𝗿𝘁() gives a new sorted output ❌ 𝗥𝗲𝗮𝗹𝗶𝘁𝘆? 𝗖𝗼𝗺𝗽𝗹𝗲𝘁𝗲𝗹𝘆 𝗱𝗶𝗳𝗳𝗲𝗿𝗲𝗻𝘁. And this is exactly why 𝘀𝗺𝗮𝗹𝗹 𝗺𝗶𝘀𝘁𝗮𝗸𝗲𝘀 𝘁𝘂𝗿𝗻 𝗶𝗻𝘁𝗼 𝗯𝗶𝗴 𝗱𝗮𝘁𝗮 𝗯𝘂𝗴𝘀. Let’s fix that 👇 🔹 𝗮𝗽𝗽𝗲𝗻𝗱(x) → Adds item to the end 💡 Modifies original list 🚫 Returns: None 🔹 𝗶𝗻𝘀𝗲𝗿𝘁(i, x) → Adds item at a specific index 💡 Keeps order control 🚫 Returns: None 🔹 𝗲𝘅𝘁𝗲𝗻𝗱(iterable) → Adds multiple items 💡 Used in merging datasets 🚫 Returns: None 🔹 𝗽𝗼𝗽([i]) → Removes + returns element 💡 Useful in pipelines & buffering ✅ Returns: removed item 🔹 𝗿𝗲𝗺𝗼𝘃𝗲(x) → Removes first occurrence ⚠️ Error if not found 🚫 Returns: None 🔹 𝗰𝗼𝗽𝘆() → Creates a shallow copy ⚠️ Nested objects still linked ✅ Returns: new list 🔹 𝗰𝗼𝘂𝗻𝘁(x) → Counts occurrences 💡 Helpful in validations ✅ Returns: integer 🔹 𝗶𝗻𝗱𝗲𝘅(x) → Finds position of value ⚠️ Error if not found ✅ Returns: index 🔹 𝗿𝗲𝘃𝗲𝗿𝘀𝗲() → Reverses list (in-place) 🚫 Returns: None 🔹 𝘀𝗼𝗿𝘁() → Sorts list (in-place) ⚠️ Doesn’t return a new list 🚫 Returns: None • Most list methods modify the original list • Only a few return values: 👉 𝗽𝗼𝗽() 👉 𝗰𝗼𝘂𝗻𝘁() 👉 𝗶𝗻𝗱𝗲𝘅() 👉 𝗰𝗼𝗽𝘆() 🔥 If you assume a return value where there is none… your pipeline will silently break. 👉 Which list method confused you the most before this? #Python #DataEngineering #LearnPython #CodingTips #ETL #DataAnalytics #TechContent
Like Comment
To view or add a comment, sign in
R Kishore Reddy
1w
Report this post
𝗪𝗼𝗿𝗸𝗶𝗻𝗴 𝘄𝗶𝘁𝗵 𝗹𝗮𝗿𝗴𝗲 𝗱𝗮𝘁𝗮𝘀𝗲𝘁𝘀 𝗶𝗻 𝗣𝗮𝗻𝗱𝗮𝘀 𝘁𝗮𝘂𝗴𝗵𝘁 𝗺𝗲 𝗼𝗻𝗲 𝘀𝗶𝗺𝗽𝗹𝗲 𝗹𝗲𝘀𝘀𝗼𝗻 — 𝗺𝗲𝗺𝗼𝗿𝘆 𝗺𝗮𝘁𝘁𝗲𝗿𝘀 𝗺𝗼𝗿𝗲 𝘁𝗵𝗮𝗻 𝘄𝗲 𝘁𝗵𝗶𝗻𝗸. In the beginning, I used to load dataframes without even thinking about how much memory they consume. Everything looked fine… until one day my script slowed down, and sometimes even crashed. That’s when I realized it’s not always about the data size, it’s about how efficiently we handle it. One simple habit that changed things for me is checking memory usage of a dataframe. In Pandas, you can do this very easily: df.info() This gives a quick summary of your dataframe, including memory usage. But if you want a more detailed view, you can use: df.memory_usage(deep=True) This shows how much memory each column is using. Adding deep=True helps you get accurate results, especially for object-type columns like strings. What I found interesting is that sometimes a few columns consume most of the memory. Especially object columns they silently take up a lot of space. Once you know where the memory is going, you can start optimizing: * Convert object columns to category if they have repeated values * Use smaller data types like int32 instead of int64 * Drop unnecessary columns early These small steps make a big difference, especially when working with large datasets. For me, this was a small learning, but very powerful. Now, before doing any heavy operations, I just take a few seconds to check memory usage and it saves me minutes (sometimes hours) later. If you’re working with Pandas, give this a try. It might look small, but it can completely change how your code performs. #BigData #Python #Pandas #DataAnalytics
Like Comment
To view or add a comment, sign in
Ninad Patil
2w
Report this post
𝗦𝗽𝗮𝗿𝗸 𝗱𝗼𝗲𝘀𝗻’𝘁 𝗿𝘂𝗻 𝘆𝗼𝘂𝗿 𝗰𝗼𝗱𝗲. 𝗜𝘁 𝗯𝘂𝗶𝗹𝗱𝘀 𝗮 𝗽𝗹𝗮𝗻. A lot of Spark confusion comes from thinking it executes “line by line” like a normal program. In reality, Spark mostly does this: 𝗬𝗼𝘂𝗿 𝗰𝗼𝗱𝗲 -> 𝗹𝗼𝗴𝗶𝗰𝗮𝗹 𝗽𝗹𝗮𝗻 -> 𝗼𝗽𝘁𝗶𝗺𝗶𝘇𝗲𝗱 𝗹𝗼𝗴𝗶𝗰𝗮𝗹 𝗽𝗹𝗮𝗻 -> 𝗽𝗵𝘆𝘀𝗶𝗰𝗮𝗹 𝗲𝘅𝗲𝗰𝘂𝘁𝗶𝗼𝗻 𝗽𝗹𝗮𝗻 So when you write PySpark or Spark SQL, Spark isn’t “running Python” or “running SQL”. It’s building a plan for a distributed engine to execute. Here’s the simplified mental model I use: 𝟭) 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻𝘀 𝗯𝘂𝗶𝗹𝗱 𝘁𝗵𝗲 𝗽𝗹𝗮𝗻 (𝗹𝗮𝘇𝘆) select, filter, join, groupBy... These don’t immediately run a job. They describe what should happen. 𝟮) 𝗔𝗰𝘁𝗶𝗼𝗻𝘀 𝘁𝗿𝗶𝗴𝗴𝗲𝗿 𝗲𝘅𝗲𝗰𝘂𝘁𝗶𝗼𝗻 count, show, collect, write… This is when Spark says: “ok, now I need to execute the plan”. 𝟯) 𝗧𝗵𝗲 𝗼𝗽𝘁𝗶𝗺𝗶𝘇𝗲𝗿 𝗿𝗲𝘄𝗿𝗶𝘁𝗲𝘀 𝘆𝗼𝘂𝗿 𝘄𝗼𝗿𝗸 Before running, Spark tries to make it cheaper: • push filters earlier • prune unused columns • reorder operations • pick join strategies 𝟰) 𝗧𝗵𝗲 𝗽𝗵𝘆𝘀𝗶𝗰𝗮𝗹 𝗽𝗹𝗮𝗻 𝗶𝘀 𝘄𝗵𝗮𝘁 𝗮𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝗿𝘂𝗻𝘀 𝗼𝗻 𝘁𝗵𝗲 𝗰𝗹𝘂𝘀𝘁𝗲𝗿 This is where you’ll see the real cost drivers: • join strategy (broadcast vs shuffle) • number of stages/tasks • shuffles, scans, exchanges • partitioning decisions That’s why two bits of Spark code that look similar can behave completely differently. 𝗧𝗮𝗸𝗲𝗮𝘄𝗮𝘆: If you can read the plan, you can explain most performance issues without guessing. Share your favourite Spark “aha” moment in the comments. #Spark #PySpark #SparkSQL #DataEngineering #BigData #Databricks #PerformanceTuning #SQL
Like Comment
To view or add a comment, sign in
Arunkumar Palanisamy
1mo
Report this post
𝗣𝗮𝗻𝗱𝗮𝘀 𝗶𝘀 𝘁𝗵𝗲 𝗱𝗲𝗳𝗮𝘂𝗹𝘁. 𝗣𝗼𝗹𝗮𝗿𝘀 𝗶𝘀 𝘁𝗵𝗲 𝘀𝗵𝗶𝗳𝘁. 𝗧𝗵𝗲 𝗾𝘂𝗲𝘀𝘁𝗶𝗼𝗻 𝗶𝘀𝗻'𝘁 𝘄𝗵𝗶𝗰𝗵 𝗶𝘀 "𝗯𝗲𝘁𝘁𝗲𝗿" 𝗶𝘁'𝘀 𝘄𝗵𝗶𝗰𝗵 𝗳𝗶𝘁𝘀 𝘆𝗼𝘂𝗿 𝘄𝗼𝗿𝗸𝗹𝗼𝗮𝗱. Pandas has been the default DataFrame library for over a decade. But as datasets grow and pipelines move toward production, its single-threaded, eager execution model starts to show cracks. That's where Polars enters. 𝗣𝗮𝗻𝗱𝗮𝘀: 𝘁𝗵𝗲 𝗳𝗮𝗺𝗶𝗹𝗶𝗮𝗿 𝗱𝗲𝗳𝗮𝘂𝗹𝘁: → Single-threaded, eager execution processes data immediately, step by step → Massive ecosystem every tutorial, every library, every StackOverflow answer → Ideal for exploration, prototyping, and datasets that fit comfortably in memory → Limitation: performance degrades on larger datasets. Memory usage can be 5-10x the raw data size. 𝗣𝗼𝗹𝗮𝗿𝘀: 𝘁𝗵𝗲 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝘀𝗵𝗶𝗳𝘁: → Multi-threaded, lazy evaluation builds a query plan and optimizes before executing → Written in Rust significantly faster on aggregations, joins, and group-bys → Native Parquet support and Apache Arrow columnar memory format → Limitation: smaller ecosystem. Fewer tutorials. Some libraries still expect Pandas DataFrames. 𝗪𝗵𝗲𝗿𝗲 𝗲𝗮𝗰𝗵 𝗳𝗶𝘁𝘀: → Exploration and prototyping → Pandas (ecosystem wins) → Production transforms on medium-large data → Polars (speed wins) → ML workflows with scikit-learn → Pandas (integration wins) → CI/CD and automated pipelines → Polars (performance wins) → SQL analytics → DuckDB (Ep 29) 𝗧𝗵𝗲 𝗱𝗲𝗰𝗶𝘀𝗶𝗼𝗻 𝗿𝘂𝗹𝗲: The shift isn't "replace Pandas." It's knowing when the workload has outgrown single-threaded, eager execution and choosing the right tool instead of the default one. Where in your stack are you treating DataFrames like scripts, when they should be treated like query plans? #DataEngineering #Python #DataArchitecture
26 Comments
Like Comment
To view or add a comment, sign in
Mohamed Boughattas
5d
Report this post
🐋 Meet Narwhals : the missing layer in your data stack! If you’ve ever struggled with switching between pandas, Polars, PySpark or DuckDB, this is for you 👉 Narwhals is a lightweight, extensible compatibility layer that lets you write dataframe-agnostic code once, and run it everywhere. 💡 Why it’s a game changer: ✅ Write code once, run across multiple dataframe engines ✅ Zero dependencies, stays lightweight ✅ Familiar API inspired by Polars ✅ Works with both eager and lazy execution ✅ Keeps native performance, no heavy abstraction cost 🔥 Imagine building libraries or pipelines that: • Automatically adapt to pandas, Polars, or Spark • Scale from local to distributed without rewrites • Stay clean, typed, and maintainable This is a big step toward true interoperability in the Python data ecosystem. If you're building data tools, this is definitely worth exploring. 📚 Learn more: https://lnkd.in/esq7vwJi 🔗 GitHub repo: https://lnkd.in/eMBNQ8eQ #DataEngineering #Python #DataScience #BigData #Analytics #OpenSource #AI #MachineLearning
Like Comment
To view or add a comment, sign in
Gopikrishna Ravipati
2w
Report this post
🚀 🔥 𝑺𝒕𝒐𝒑 𝑺𝒕𝒓𝒖𝒈𝒈𝒍𝒊𝒏𝒈 𝒘𝒊𝒕𝒉 𝑫𝒊𝒓𝒕𝒚 𝑫𝒂𝒕𝒂 — 𝑴𝒂𝒔𝒕𝒆𝒓 𝑷𝒚𝒕𝒉𝒐𝒏 𝑫𝒂𝒕𝒂 𝑪𝒍𝒆𝒂𝒏𝒊𝒏𝒈 𝒊𝒏 𝑴𝒊𝒏𝒖𝒕𝒆𝒔 (2026) Most people learn Python… But fail at real data work ❌ Because they ignore ONE skill 👇 👉 Data Cleaning ⚡ Here’s your cheat sheet to become a PRO: 🧹 Fix Missing Data df.isnull().sum() df.fillna(method='ffill') df.dropna() 🧹 Remove Duplicates df.drop_duplicates() 🧹 Understand Your Data df.head() df.info() df.describe() 🧹 Clean Columns df.rename(columns={'old':'new'}) df.astype({'col':'int'}) 🧹 Filter Smartly df.query("salary > 50000") df[df['role'].isin(['DE','DS'])] 🧹 Merge Like a Pro pd.merge(df1, df2, on='id') df.groupby('team').agg({'salary':'mean'}) 🎯 Reality Check (2026): 👉 80% of time = Cleaning data 👉 20% of time = Analysis If your data is messy → your results are wrong ❌ 💬 Engagement Hook: Be honest — Do you enjoy data cleaning or hate it? 😅👇 #Python #Pandas #DataCleaning #DataEngineering #DataScience #MachineLearning #Analytics #LearnPython #TechCareers #Coding #BigData
Like Comment
To view or add a comment, sign in

5 followers

View Profile Follow

Pandas stores data in columns, not rows, for speed and efficiency

More Relevant Posts

Explore content categories