📌𝗣𝘆𝘁𝗵𝗼𝗻 𝗟𝗶𝘀𝘁 𝗠𝗲𝘁𝗵𝗼𝗱𝘀 — 𝘄𝗵𝗮𝘁 𝗮𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝗵𝗮𝗽𝗽𝗲𝗻𝘀 𝗯𝗲𝗵𝗶𝗻𝗱 𝘁𝗵𝗲 𝘀𝗰𝗲𝗻𝗲𝘀- We think: → 𝗮𝗽𝗽𝗲𝗻𝗱() returns a new list ❌ → 𝗰𝗼𝗽𝘆() creates a deep copy ❌ → 𝘀𝗼𝗿𝘁() gives a new sorted output ❌ 𝗥𝗲𝗮𝗹𝗶𝘁𝘆? 𝗖𝗼𝗺𝗽𝗹𝗲𝘁𝗲𝗹𝘆 𝗱𝗶𝗳𝗳𝗲𝗿𝗲𝗻𝘁. And this is exactly why 𝘀𝗺𝗮𝗹𝗹 𝗺𝗶𝘀𝘁𝗮𝗸𝗲𝘀 𝘁𝘂𝗿𝗻 𝗶𝗻𝘁𝗼 𝗯𝗶𝗴 𝗱𝗮𝘁𝗮 𝗯𝘂𝗴𝘀. Let’s fix that 👇 🔹 𝗮𝗽𝗽𝗲𝗻𝗱(x) → Adds item to the end 💡 Modifies original list 🚫 Returns: None 🔹 𝗶𝗻𝘀𝗲𝗿𝘁(i, x) → Adds item at a specific index 💡 Keeps order control 🚫 Returns: None 🔹 𝗲𝘅𝘁𝗲𝗻𝗱(iterable) → Adds multiple items 💡 Used in merging datasets 🚫 Returns: None 🔹 𝗽𝗼𝗽([i]) → Removes + returns element 💡 Useful in pipelines & buffering ✅ Returns: removed item 🔹 𝗿𝗲𝗺𝗼𝘃𝗲(x) → Removes first occurrence ⚠️ Error if not found 🚫 Returns: None 🔹 𝗰𝗼𝗽𝘆() → Creates a shallow copy ⚠️ Nested objects still linked ✅ Returns: new list 🔹 𝗰𝗼𝘂𝗻𝘁(x) → Counts occurrences 💡 Helpful in validations ✅ Returns: integer 🔹 𝗶𝗻𝗱𝗲𝘅(x) → Finds position of value ⚠️ Error if not found ✅ Returns: index 🔹 𝗿𝗲𝘃𝗲𝗿𝘀𝗲() → Reverses list (in-place) 🚫 Returns: None 🔹 𝘀𝗼𝗿𝘁() → Sorts list (in-place) ⚠️ Doesn’t return a new list 🚫 Returns: None • Most list methods modify the original list • Only a few return values: 👉 𝗽𝗼𝗽() 👉 𝗰𝗼𝘂𝗻𝘁() 👉 𝗶𝗻𝗱𝗲𝘅() 👉 𝗰𝗼𝗽𝘆() 🔥 If you assume a return value where there is none… your pipeline will silently break. 👉 Which list method confused you the most before this? #Python #DataEngineering #LearnPython #CodingTips #ETL #DataAnalytics #TechContent
Python List Methods: What They Do and What They Return
More Relevant Posts
-
🚀 🔥 𝑺𝒕𝒐𝒑 𝑺𝒕𝒓𝒖𝒈𝒈𝒍𝒊𝒏𝒈 𝒘𝒊𝒕𝒉 𝑫𝒊𝒓𝒕𝒚 𝑫𝒂𝒕𝒂 — 𝑴𝒂𝒔𝒕𝒆𝒓 𝑷𝒚𝒕𝒉𝒐𝒏 𝑫𝒂𝒕𝒂 𝑪𝒍𝒆𝒂𝒏𝒊𝒏𝒈 𝒊𝒏 𝑴𝒊𝒏𝒖𝒕𝒆𝒔 (2026) Most people learn Python… But fail at real data work ❌ Because they ignore ONE skill 👇 👉 Data Cleaning ⚡ Here’s your cheat sheet to become a PRO: 🧹 Fix Missing Data df.isnull().sum() df.fillna(method='ffill') df.dropna() 🧹 Remove Duplicates df.drop_duplicates() 🧹 Understand Your Data df.head() df.info() df.describe() 🧹 Clean Columns df.rename(columns={'old':'new'}) df.astype({'col':'int'}) 🧹 Filter Smartly df.query("salary > 50000") df[df['role'].isin(['DE','DS'])] 🧹 Merge Like a Pro pd.merge(df1, df2, on='id') df.groupby('team').agg({'salary':'mean'}) 🎯 Reality Check (2026): 👉 80% of time = Cleaning data 👉 20% of time = Analysis If your data is messy → your results are wrong ❌ 💬 Engagement Hook: Be honest — Do you enjoy data cleaning or hate it? 😅👇 #Python #Pandas #DataCleaning #DataEngineering #DataScience #MachineLearning #Analytics #LearnPython #TechCareers #Coding #BigData
To view or add a comment, sign in
-
-
𝗜𝗳 𝘆𝗼𝘂 𝘄𝗼𝗿𝗸 𝘄𝗶𝘁𝗵 𝗱𝗮𝘁𝗮, 𝘆𝗼𝘂 𝗸𝗻𝗼𝘄 𝘁𝗵𝗶𝘀 — 𝗽𝗵𝗼𝗻𝗲 𝗻𝘂𝗺𝗯𝗲𝗿𝘀 𝗮𝗿𝗲 𝗻𝗲𝘃𝗲𝗿 𝗰𝗹𝗲𝗮𝗻 Sometimes they come with spaces, sometimes with country codes, sometimes with special characters like “+”, “-”, or even brackets. And sometimes, they even come with .00 at the end because of how data is stored or exported. And if we don’t clean them properly, it becomes very difficult to use that data for analysis or communication. In Pandas, cleaning phone number columns is actually simple once you understand the approach. First, I usually convert the column to string format. This avoids unexpected issues, especially when numbers are stored as integers, floats, or mixed types. After that, the main step is removing unwanted characters. Using regular expressions, we can keep only digits and remove everything else — including .00, symbols, and spaces. For example: df['phone'] = df['phone'].astype(str).str.replace(r'[^0-9]', '', regex=True) This one line can handle most messy formats. One important step I always follow is standardizing the final output. No matter how the number comes, I take only the last 10 digits. This helps remove country codes like +91 and keeps the data consistent. Something like: df['phone'] = df['phone'].str[-10:] Next comes validation. Not every cleaned number is valid. Some may be too short or too long. So I often filter numbers based on length to make sure we only keep meaningful data. If needed, I also format the numbers again in a clean and readable way. What I learned from this is simple — data cleaning is not about writing complex code, it’s about thinking clearly about the problem. Once the logic is clear, Pandas makes the job very easy. Small steps like this make a big difference when working with large datasets. #DataScience #DataAnalytics #Python #Pandas #DataCleaning
To view or add a comment, sign in
-
-
I used to struggle with Pandas… Until I learned these 12 functions Now I use them almost daily for: ✔️ Cleaning messy datasets ✔️ Exploring data faster ✔️ Building efficient workflows If you’re working with data, these are NON-NEGOTIABLE: 🔹 read_csv() – Load data instantly 🔹 head() – Quick preview 🔹 info() – Understand structure 🔹 describe() – Summary stats 🔹 isnull() – Find missing values 🔹 dropna() – Remove missing records 🔹 fillna() – Handle nulls 🔹 groupby() – Powerful aggregations 🔹 sort_values() – Organize data 🔹 value_counts() – Frequency analysis 🔹 merge() – Combine datasets 🔹 apply() – Custom logic I’ve personally used these while working on data validation & analysis tasks — and they’ve made everything faster and cleaner. Which Pandas function do you use the most? Or which one are you learning next? 📌 Save this post — you’ll thank yourself later #Python #Pandas #DataAnalysis #DataScience #DataEngineering #Analytics #LearnPython #TechCareers
To view or add a comment, sign in
-
-
𝑴𝒐𝒔𝒕 𝒄𝒐𝒎𝒑𝒂𝒏𝒊𝒆𝒔 𝒔𝒕𝒐𝒓𝒆 𝒕𝒉𝒆𝒊𝒓 𝒅𝒂𝒕𝒂 𝒕𝒉𝒆 𝒘𝒓𝒐𝒏𝒈 𝒘𝒂𝒚. 𝑯𝒆𝒓𝒆'𝒔 𝒘𝒉𝒚 𝒊𝒕 𝒎𝒂𝒕𝒕𝒆𝒓𝒔. When you work with data in Python, you're likely using pandas. And pandas made a very deliberate choice: it stores data in 𝐜𝐨𝐥𝐮𝐦𝐧𝐬, not rows. This isn't a technical detail. It has real consequences for your team's speed and infrastructure costs. 𝐑𝐨𝐰 𝐬𝐭𝐨𝐫𝐚𝐠𝐞 (𝐡𝐨𝐰 𝐉𝐒𝐎𝐍 𝐰𝐨𝐫𝐤𝐬): Every record is a self-contained dictionary. Great for APIs and transactional systems — you always grab the full object. 𝐂𝐨𝐥𝐮𝐦𝐧𝐚𝐫 𝐬𝐭𝐨𝐫𝐚𝐠𝐞 (𝐡𝐨𝐰 𝐩𝐚𝐧𝐝𝐚𝐬 𝐰𝐨𝐫𝐤𝐬): Every column is a contiguous list. All ages together. All names together. All cities together. Why does this matter in practice? → 𝐒𝐩𝐞𝐞𝐝. When you calculate the average age of your customers, columnar storage loops over a single array of integers in memory. Row storage has to dig into each individual record, one by one. The difference at scale is enormous. → 𝐌𝐞𝐦𝐨𝐫𝐲. In row storage, the key "age" is repeated for every single row. In columnar storage, it's stored once. With millions of records, this adds up fast. → 𝐕𝐞𝐜𝐭𝐨𝐫𝐢𝐳𝐚𝐭𝐢𝐨𝐧. NumPy can apply operations to an entire column at C-level speed. With row-oriented data, you're stuck with Python-level loops. → 𝐂𝐨𝐦𝐩𝐫𝐞𝐬𝐬𝐢𝐨𝐧. Columns compress beautifully because similar values live next to each other. This is why formats like Parquet are so efficient for storage and I/O. The rule of thumb: - Building APIs or handling transactions? 𝐑𝐨𝐰-𝐨𝐫𝐢𝐞𝐧𝐭𝐞𝐝 𝐢𝐬 𝐟𝐢𝐧𝐞. - Running aggregations, filters, ML pipelines, or any analytical workload? 𝐂𝐨𝐥𝐮𝐦𝐧𝐚𝐫 𝐢𝐬 𝐭𝐡𝐞 𝐫𝐢𝐠𝐡𝐭 𝐭𝐨𝐨𝐥. If you're frequently converting pandas DataFrames back to JSON records (𝘥𝘧.𝘵𝘰_𝘥𝘪𝘤𝘵(𝘰𝘳𝘪𝘦𝘯𝘵='𝘳𝘦𝘤𝘰𝘳𝘥𝘴')), you're often leaving significant performance on the table. The data format you choose upstream shapes the cost and speed of every analysis downstream. Choose deliberately. At Arraxis, we help companies make practical decisions about how they store, structure, and use their data. #DataEngineering #Python #Pandas #DataStrategy #Analytics #BusinessIntelligence
To view or add a comment, sign in
-
If you're still cleaning CSVs by hand in 2026, you're working too hard. The same 5 tasks repeat in every analyst's day, and Python can handle each one in under 10 lines of code. Yet most teams keep grinding through them manually. Here are 8 Python automation scripts every data analyst should keep in their toolkit: 🔹 𝐀𝐮𝐭𝐨 𝐂𝐥𝐞𝐚𝐧 𝐂𝐒𝐕 𝐅𝐢𝐥𝐞𝐬 Drop duplicates, fill nulls, lowercase columns, and standardize names in 4 lines of pandas. 🔹 𝐌𝐞𝐫𝐠𝐞 𝐌𝐮𝐥𝐭𝐢𝐩𝐥𝐞 𝐂𝐒𝐕𝐬 Combine every CSV in a folder using glob + pd.concat. One script, infinite files. 🔹 𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐞 𝐒𝐮𝐦𝐦𝐚𝐫𝐲 𝐑𝐞𝐩𝐨𝐫𝐭 df.describe() exports a full statistical summary in seconds. 🔹 𝐃𝐞𝐭𝐞𝐜𝐭 𝐌𝐢𝐬𝐬𝐢𝐧𝐠 𝐕𝐚𝐥𝐮𝐞𝐬 df.isnull().sum() catches every gap in your dataset, no manual checking. 🔹 𝐂𝐫𝐞𝐚𝐭𝐞 𝐄𝐱𝐜𝐞𝐥 𝐑𝐞𝐩𝐨𝐫𝐭 Group data and write polished Excel sheets with ExcelWriter. No copy-paste. 🔹 𝐀𝐮𝐭𝐨𝐦𝐚𝐭𝐞 𝐃𝐚𝐭𝐚 𝐕𝐢𝐬𝐮𝐚𝐥𝐢𝐳𝐚𝐭𝐢𝐨𝐧 Generate matplotlib charts and save them as PNGs ready for stakeholders. 🔹 𝐒𝐞𝐧𝐝 𝐄𝐦𝐚𝐢𝐥 𝐑𝐞𝐩𝐨𝐫𝐭 smtplib + EmailMessage delivers daily reports straight to your team. 🔹 𝐒𝐜𝐡𝐞𝐝𝐮𝐥𝐞 𝐒𝐜𝐫𝐢𝐩𝐭 𝐄𝐱𝐞𝐜𝐮𝐭𝐢𝐨𝐧 The schedule library runs scripts on autopilot. Set it once, forget it. The difference between a good analyst and a great one isn't tools. It's how much they automate. Save this and start replacing one repetitive task at a time. #Python #DataAnalytics #Pandas #Automation #DataScience
To view or add a comment, sign in
-
-
Day 19 — Merging & Joining Data in Pandas As I continue deepening my understanding of pandas, today’s focus was on something very practical: combining datasets. In real-world scenarios, data rarely comes in a single clean table. You often have multiple datasets that need to be brought together before any meaningful analysis can happen. That’s where pandas functions like merge(), join(), and concat() come in. Here’s a quick breakdown of what I learned: 🔹 merge() This is similar to SQL joins. It allows you to combine datasets based on a common column. You can perform: Inner joins Left joins Right joins Outer joins Example: pd.merge(df1, df2, on="id", how="inner") 🔹 join() Used mainly for combining DataFrames based on their index. It’s a bit more concise when working with indexed data. 🔹 concat() Used to stack DataFrames either: Vertically (adding more rows) Horizontally (adding more columns) Example: pd.concat([df1, df2], axis=0) 💡 Key Insight: Understanding when to use each method is crucial. Use merge() when working with relational data Use concat() when stacking data Use join() for index-based alignment This concept is especially important in data cleaning and preprocessing, where datasets often come from different sources. Each day, pandas feels less like a tool and more like a language for working with data. #M4aceLearningChallenge #Day19 #DataScience #MachineLearning #Python #Pandas #DataAnalysis
To view or add a comment, sign in
-
𝗜 𝗹𝗲𝗮𝗿𝗻𝗲𝗱 𝘀𝗼𝗺𝗲𝘁𝗵𝗶𝗻𝗴 𝘀𝗺𝗮𝗹𝗹 𝗯𝘂𝘁 𝘃𝗲𝗿𝘆 𝗽𝗼𝘄𝗲𝗿𝗳𝘂𝗹 𝘄𝗵𝗶𝗹𝗲 𝘄𝗼𝗿𝗸𝗶𝗻𝗴 𝘄𝗶𝘁𝗵 𝗣𝗮𝗻𝗱𝗮𝘀 𝗺𝗲𝗿𝗴𝗲𝘀 — 𝘂𝘀𝗶𝗻𝗴 𝗶𝗻𝗱𝗶𝗰𝗮𝘁𝗼𝗿=𝗧𝗿𝘂𝗲 At first, I used to merge DataFrames and just trust the result. If the output looked right, I would move on. But many times, hidden issues were there missing matches, unexpected duplicates, or extra rows. Then I discovered the indicator=True parameter. When you use it in a merge, Pandas adds a new column called "_merge". This column tells you exactly where each row came from: * "left_only" → present only in the left DataFrame * "right_only" → present only in the right DataFrame * "both" → matched in both This one column completely changed how I debug merges. Instead of guessing, I can now clearly see: * Which records didn’t match * If my join keys are correct * Whether I’m losing or gaining data unexpectedly For example, after a merge, I just do a quick check: df['_merge'].value_counts() In seconds, I know if something is wrong. This is especially useful in real-world data pipelines where data is messy and assumptions often fail. It’s a small trick, but it gives a lot of confidence in your data. #DataScience #Python #Pandas #DataEngineering #DataAnalytics
To view or add a comment, sign in
-
-
🐍 Day 3/30 — Python for Data Engineers Dictionaries & Sets. The tools that make pipelines fast. Every Data Engineer works with dicts daily — whether parsing API responses, defining schemas, or managing configs. But here's the one that most beginners miss 👇 Sets are basically SQL operations: A & B → INNER JOIN (intersection) A | B → FULL OUTER JOIN (union) A - B → LEFT ANTI JOIN (difference) A ^ B → schema drift detector 🚨 That last one is genuinely useful in production: new_cols = incoming_cols - expected_cols # → {"total"} ← column you didn't expect. Alert! And remember: dict/set lookup is O(1) — hash table under the hood. List lookup is O(n) — it scans every element. On 10M rows, that difference is seconds vs milliseconds. 📌 Full cheat sheet in the image — methods, comprehensions, real DE patterns. Day 4 tomorrow: Functions & Lambda 🔧 What's your most-used dict method? .get() or .items()? Drop it below 👇 #Python #DataEngineering #30DaysOfPython #LearnPython #DataEngineer #SQL
To view or add a comment, sign in
-
-
𝗪𝗼𝗿𝗸𝗶𝗻𝗴 𝘄𝗶𝘁𝗵 𝗹𝗮𝗿𝗴𝗲 𝗱𝗮𝘁𝗮𝘀𝗲𝘁𝘀 𝗶𝗻 𝗣𝗮𝗻𝗱𝗮𝘀 𝘁𝗮𝘂𝗴𝗵𝘁 𝗺𝗲 𝗼𝗻𝗲 𝘀𝗶𝗺𝗽𝗹𝗲 𝗹𝗲𝘀𝘀𝗼𝗻 — 𝗺𝗲𝗺𝗼𝗿𝘆 𝗺𝗮𝘁𝘁𝗲𝗿𝘀 𝗺𝗼𝗿𝗲 𝘁𝗵𝗮𝗻 𝘄𝗲 𝘁𝗵𝗶𝗻𝗸. In the beginning, I used to load dataframes without even thinking about how much memory they consume. Everything looked fine… until one day my script slowed down, and sometimes even crashed. That’s when I realized it’s not always about the data size, it’s about how efficiently we handle it. One simple habit that changed things for me is checking memory usage of a dataframe. In Pandas, you can do this very easily: df.info() This gives a quick summary of your dataframe, including memory usage. But if you want a more detailed view, you can use: df.memory_usage(deep=True) This shows how much memory each column is using. Adding deep=True helps you get accurate results, especially for object-type columns like strings. What I found interesting is that sometimes a few columns consume most of the memory. Especially object columns they silently take up a lot of space. Once you know where the memory is going, you can start optimizing: * Convert object columns to category if they have repeated values * Use smaller data types like int32 instead of int64 * Drop unnecessary columns early These small steps make a big difference, especially when working with large datasets. For me, this was a small learning, but very powerful. Now, before doing any heavy operations, I just take a few seconds to check memory usage and it saves me minutes (sometimes hours) later. If you’re working with Pandas, give this a try. It might look small, but it can completely change how your code performs. #BigData #Python #Pandas #DataAnalytics
To view or add a comment, sign in
-
-
🔶drop_duplicates() catches exact copies. But real data has a sneakier problem that it completely misses. Same person. Slightly different entry. Same Employee ID. Employee_ID: 10234 | Name: John Kamau | Dept: Sales Employee_ID: 10234 | Name: J. Kamau | Dept: sales Those look different enough that drop_duplicates() won’t touch them. But they’re the same person entered twice. Here’s how to catch it: # Find IDs appearing more than once duplicate_ids = df[ df.duplicated(subset=[“Employee_ID”], keep=False) ] print(f“Records with duplicate IDs: {len(duplicate_ids)}”) print(duplicate_ids.sort_values(“Employee_ID”).head(20)) 🔷This shows every row that shares an ID with another row. Now you can actually investigate instead of guessing. The fix depends on what you find: # Keep only the most recent entry per employee df = df.sort_values(“date_added”, ascending=False) df = df.drop_duplicates(subset=[“Employee_ID”], keep=“first”) Soft duplicates are dangerous for one reason: your analysis treats one person as two data points. Your model learns from the same person twice. Your headcount reports are wrong from the start. And none of it raises an error. Everything looks fine. 📍Check for duplicates by key columns, not just identical rows. That extra step catches what the default function misses. ❓Have you ever found soft duplicates in a dataset? What gave it away? #DataCleaning #Python #DataScience
To view or add a comment, sign in
Explore content categories
- Career
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Hospitality & Tourism
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development