Rebuilding Python Foundation with Core Basics and OOP

1mo Edited

𝗜 𝘀𝘁𝗿𝘂𝗴𝗴𝗹𝗲𝗱 𝘄𝗶𝘁𝗵 𝗮 𝗣𝘆𝘁𝗵𝗼𝗻 𝗢𝗢𝗣 𝗾𝘂𝗲𝘀𝘁𝗶𝗼𝗻 𝗶𝗻 𝗮𝗻 𝗶𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄. And that’s when I realised… “Knowing Python” ≠ “Understanding Python deeply.” Over the last 3 weeks, I went back to basics and rebuilt my Python foundation from scratch — this time with more clarity + practice, not just theory. I didn’t just watch videos. I practised everything hands-on on Google Colab alongside learning. Here’s what I revised and strengthened: 🔹 Core Basics • Variables & Data Types • Operators (Arithmetic, Comparison, Logical) • Input/Output • Conditional Statements • Loops (for, while) 🔹 Data Structures • Lists (indexing, slicing) • Tuples, Sets • Dictionaries (operations) 🔹 Functions • Function definitions & returns • Default / positional / keyword arguments • *args and **kwargs • Lambda functions 🔹 Functional Programming • List comprehensions • map(), filter(), zip() 🔹 File Handling & Exceptions • File read/write, with open() • CSV basics • try / except / finally • Handling multiple exceptions 🔹 Iteration & Generators • Iterables vs Iterators • enumerate() • Generators & yield 🔹 Python Internals • f-strings, raw strings • Dunder variables (__name__, __doc__) • if __name__ == "__main__" • Unpacking (*, **, _) • Escape sequences, docstrings • Importing libraries 🔹 OOP (Core + Advanced) • Classes, Objects, __init__, self • Instance / Class / Static methods • Encapsulation, Inheritance, Polymorphism, Abstraction • Private & Protected variables • super() • Getters, Setters, @property 🔹 Decorators • Wrapper functions • @ syntax • Relation with *args, **kwargs 🔹 Coding Practices • Modular coding • Pythonic vs traditional coding • Clean structure 🔹 Time and Space Complexity 🔹 Common Data libraries: * NumPy → numerical computing * Pandas → data analysis * Matplotlib/Seaborn → visualisation Learning resources: • Python Playlist by Data with Baraa by Baraa — https://lnkd.in/gdapBd4f • Visually Explained playlists — https://lnkd.in/g3RuBERm • Python OOPs by Rishabh Mishra — https://lnkd.in/gvkBZ3Nj • ChatGPT study mode • GeeksforGeeks After this deep dive, I can confidently say: Strong fundamentals change how you think, not just how you code. Next step → diving into Python interview Qs & problem-solving. Grateful to all the learning resources!!! Happy learning 😀 #Data #DataAnalyst #Python #LearningJourney #InterviewPreparation #DataAnalytics #OOP #Programming #Upskilling #Consistency #Opentowork #India

To view or add a comment, sign in

More Relevant Posts

Carlos Seiti Hurtado Shiraishi
2w Edited
Report this post
Behind the Scenes of the .pkl File: How Python "Freezes" Your Data 🥒📦 If you work with Python for Machine Learning, QSAR, or Data Engineering, you’ve definitely seen .pkl files. But have you ever wondered what’s actually happening under the hood when you save one? Unlike a CSV or JSON, which only stores raw text and numbers, a Pickle file stores the soul of your Python object. 🧠 How it Works: The Magic of Serialization The process behind a .pkl file is called Serialization (or "Pickling"): Memory Mapping: When you create a complex model or a chemical database, Python organizes it in your RAM with a sophisticated web of pointers and references. The Byte Stream: The pickle library traverses that complex structure and flattens it into a linear stream of bytes(a sequence of 0s and 1s). Perfect Reconstruction: When you use pickle.load, Python reads that stream and rebuilds the object with the exact same structure, data types, and attributes it had before. It’s like disassembling a LEGO castle, labeling every piece, and perfectly reassembling it in a different room. 📁 What does it save that a CSV can't? While a text file "forgets" the properties of an object, a .pkl preserves: Exact Typing: If your data was a 64-bit float or a specific NumPy array type, it stays that way. Object Relationships: If you have a dictionary pointing to a list of SMILES strings, those internal links remain intact. Learned Parameters: For Machine Learning, it saves the weights and coefficients your algorithm spent hours (or days) learning. 🛠️ The Syntax: "wb" and "rb" In your code, you will always see these modes: 'wb' (Write Binary): Necessary because you aren't writing "text," you are writing raw machine data. 'rb' (Read Binary): Necessary to translate those bytes back into a Python object you can interact with. ⚖️ When should you use it? ✅ YES for: Saving trained models, pre-computed molecular fingerprints, or saving the state of a long-running experiment. ❌ NO for: Public data sharing (use JSON or Parquet for security) or when you need to open the file in another language like R or Julia. Understanding your file formats is the first step toward building more robust, reproducible research workflows! 🚀 #Python #DataScience #MachineLearning #Pickle #Programming #TechInsights #QSAR #Bioinformatics #CodingTips
Like Comment
To view or add a comment, sign in
Ananda Emka Oktora

Search Engine Optimization Staff at SmartSites 💡
3w Edited
Report this post
Week 4 hit differently. Not just because the material stacked up fast, but because this week, things started clicking into a much bigger picture. Digital Skola SESSION 10 Python is more than a language it's a way of thinking We went deep into Python's core data structures. List for ordered, mutable collections. Dictionary for key-value pairs that map attributes to values. Tuple for ordered, immutable sequences when you want data that shouldn't change. Then came Conditional Statements if, if-else, if-elif-else, and Nested-if giving programs the ability to make decisions. And Loops for, while, break, continue, pass, nested loops giving programs the ability to repeat. Separately, they're just syntax. Together, they're the building blocks of logic. SESSION 11 Functions: write once, use everywhere A function is a reusable block of code built for a specific task. Once defined with def, it can accept parameters, process logic, and return a value called as many times as needed without rewriting anything. We also got into scope: local variables live inside the function, global variables are accessible everywhere, and the global keyword bridges the two. Then Lambda anonymous one-liner functions inherited from Lambda Calculus theory, 1930. Short, clean, direct. And Modules and Packages, showing how Python code organizes itself into reusable, importable units at scale. The benefit isn't just shorter code. It's code that's easier to read, easier to fix, and easier to hand off to someone else. SESSION 12 NumPy: where Python becomes serious about data NumPy (Numerical Python) is the foundational library for scientific computing, linear algebra, and data analysis in Python. Its core is the ndarray a multidimensional array that's fast, homogeneous, and built for numerical work. 1D arrays as vectors. 2D as matrices. 3D as tensors. And a full toolkit of operations: reshape, flatten, transpose, concatenate, stack, split, resize, append, insert, delete, unique, and slicing. Each one exists for a reason, and understanding when to use which one is where the real learning lives. This week made it clear: data science isn't just about knowing tools. It's about understanding what each tool is actually doing to your data. Group 3 Baby Python is still very much in the game. The material gets heavier every week and that's exactly the point. #DigitalSkola #LearningProgressReview #DataScience #Python #NumPy #Bootcamp #BabyPython
Like Comment
To view or add a comment, sign in
Bharat Kumar Aryasomayajula
2w
Report this post
If you have done a little coding, one of the tasks you might perform is sort() sorted(), most people think Python’s sort() is just… sorting. But under the hood, it’s running one of the most elegant algorithms ever designed for real-world data. Python doesn’t use QuickSort. It uses Timsort. And since Python 3.11, it got even better with Powersort. 🔍 What’s actually happening? Python’s: list.sort() sorted() are powered by Timsort (and now an improved merge strategy via Powersort). Timsort is a hybrid of: Merge Sort Insertion Sort But here’s the twist 👇 👉 It’s designed for real-world data, not random arrays. ⚡ Key Insight: “Runs” Timsort scans your data for already sorted chunks (called runs). Example: [1, 2, 3, 10, 9, 8, 20, 21] It sees: [1, 2, 3, 10] → already sorted [9, 8] → reverse run (fixed internally) [20, 21] → sorted Instead of sorting from scratch, it merges these runs efficiently. 👉 That’s why Python sorting can be O(n) in best cases. What changed in Python 3.11? Python introduced Powersort (an improved merge strategy). Still stable ✅ Still adaptive ✅ But closer to optimal merging decisions 👉 Translation: faster in complex real-world scenarios. 🧠 Stability (this matters more than you think) Python sorting is stable. data = [("A", 90), ("B", 90), ("C", 80)] sorted(data, key=lambda x: x[1]) Output: [('C', 80), ('A', 90), ('B', 90)] 👉 Notice A stays before B (original order preserved) This is critical in: Multi-level sorting Ranking systems Financial data pipelines ⚙️ Small Data Optimization For small arrays (< ~64 elements), Python switches to: 👉 Binary Insertion Sort Why? Lower overhead Faster in practice for small inputs 🔄 sort() vs sorted() arr.sort() # in-place, modifies original sorted(arr) # returns new list 👉 Same algorithm, different behavior. Python vs Excel Python → Timsort / Powersort (adaptive, stable) Excel → QuickSort (mostly) QuickSort is fast on random data, but Python wins on partially sorted real-world data. Python sorting isn’t just fast, It’s: Adaptive Stable Hybrid Real-world optimized And that’s why it quietly outperforms “theoretically faster” algorithms in practice. Sometimes the smartest systems don’t reinvent everything… they just optimize for how data actually behaves. #Python #Algorithms #SoftwareEngineering #DataStructures #Coding #TechDeepDive
Like Comment
To view or add a comment, sign in
anuj chhetri
1mo
Report this post
Day 12 of My Data Science Journey — Python Lists: Methods, Comprehension & Shallow vs Deep Copy Today’s focus was on one of the most essential data structures in Python — Lists. From data storage to manipulation, lists are used everywhere in real-world applications and data science workflows. 𝐖𝐡𝐚𝐭 𝐈 𝐋𝐞𝐚𝐫𝐧𝐞𝐝: List Properties – Ordered, mutable, allows duplicates, and supports mixed data types Accessing Elements – Used indexing, negative indexing, slicing, and stride for flexible data access List Methods – append(), extend(), insert() for adding elements – remove(), pop() for deletion – sort(), reverse() for ordering – count(), index() for searching and analysis Shallow vs Deep Copy – Understood that direct assignment does not create a new copy – Used copy(), list(), slicing for safe duplication – Learned the importance of copying, especially with nested data List Comprehension – Wrote concise and efficient code using list comprehension – Combined loops and conditions in a single readable line Built-in Functions – Used sum(), len(), min(), max() for quick data insights Additional Useful Methods – clear(), sorted(), zip(), filter(), map(), any(), all() 𝐊𝐞𝐲 𝐈𝐧𝐬𝐢𝐠𝐡𝐭: Understanding how lists work — especially copying and comprehension — is critical for writing efficient and bug-free Python code. Lists are not just a data structure; they are a core tool for solving real-world problems. Read the full breakdown with examples on Medium 👇 https://lnkd.in/gFp-nHzd #DataScienceJourney #Python #Lists #Programming

Python Lists: Complete Guide from Basics to List Comprehension medium.com
Like Comment
To view or add a comment, sign in
anuj chhetri
3w
Report this post
My Data Science Journey — Python Tuple, Set, Dictionary & the Collections Library Today’s focus was on Python’s core data structures — Tuples, Sets, and Dictionaries — along with the powerful collections module that enhances their functionality for real-world use cases. 𝐖𝐡𝐚𝐭 𝐈 𝐋𝐞𝐚𝐫𝐧𝐞𝐝: Tuple – Ordered, immutable, allows duplicates – Single element tuples require a trailing comma → ("cat",) – Supports packing and unpacking → x, y = 10, 30 – Cannot be modified after creation (TypeError by design) – Faster than lists in certain operations – Used in scenarios like geographic coordinates and fixed records – Can be used as dictionary keys (unlike lists) Set – Unordered, mutable, stores unique elements only – No indexing or slicing support – Empty set must be created using set() ({} creates a dict) – .remove() raises KeyError if element not found – .discard() removes safely without error – Supports operations like union, intersection, difference, symmetric_difference – Methods like issubset(), issuperset(), isdisjoint() help in set comparisons – frozenset provides an immutable version of a set – Offers O(1) average time complexity for membership checks Dictionary – Key-value pair structure, ordered, mutable, and keys must be unique – Built on hash tables for fast lookups – user["key"] → raises KeyError if missing – user.get("key", default) → safe access with fallback – Methods: keys(), values(), items() for iteration – pop(), popitem(), update(), clear(), del for modifications – Widely used in real-world data like APIs and JSON responses – Common pattern: list of dictionaries for structured datasets Collections Library – namedtuple → tuple with named fields for better readability – deque → efficient queue with O(1) operations on both ends – ChainMap → combines multiple dictionaries without merging copies – OrderedDict → maintains order with additional utilities like move_to_end() – UserDict, UserList, UserString → useful for customizing built-in behaviors with validation and extensions Performance Insight – List → O(n) – Tuple → O(n) – Set → O(1) (average lookup) – Dictionary → O(1) (average lookup) 𝐊𝐞𝐲 𝐈𝐧𝐬𝐢𝐠𝐡𝐭: Understanding when to use each data structure — and how collections enhances them — is crucial for writing efficient, scalable, and clean Python code. Read the full breakdown with examples on Medium 👇 https://lnkd.in/gvv5ZBDM #DataScienceJourney #Python #Tuple #Set #Dictionary #Collections #Programming #DataStructures

Python — Tuple, Set, Dictionary & the collections Library: A Complete Guide medium.com
Like Comment
To view or add a comment, sign in
David Quayefio
1w
Report this post
ANATOMY OF NUMPY NumPy is often described as "fast arrays for Python," but architecturally, it is a typed contiguous memory layout with vectorized C operations and broadcasting rules that let you compute on millions of elements without writing a single loop. Understanding its anatomy means knowing why np.dot beats a Python for-loop by 100×, and when it silently returns the wrong answer. 1. THE NDARRAY The core object: a contiguous block of memory plus metadata. dtype: fixed data type (float32, int64, bool). Chosen once, applied to every element. Wrong dtype silently loses precision. shape: tuple of dimensions. A (3, 4) array has 12 elements. Operations must agree on the shape or broadcast. strides: bytes to jump when moving along each axis. Transposes and slices just update strides, never copy data. 2. BROADCASTING NumPy aligns shapes from the right. Size-1 dimensions stretch to match. Rule 1: If shapes have different dimensions, prepend 1s to the shorter one. Rule 2: size 1 stretches to match. Anything else must agree, or broadcasting fails. Why it matters: avoids allocating huge intermediate arrays. x[:, None] - y[None, :] computes pairwise diffs with no loop. 3. VECTORIZATION Every operation is implemented in C with SIMD. Python loops are 50-200× slower. Ufuncs: np.sin, np.exp, np.add. Elementwise with no Python overhead. Reductions: sum, mean, argmax. Accept an axis argument to reduce along specific dimensions. Linear algebra: np.dot, np.linalg.solve, @ operator. Calls BLAS/LAPACK under the hood. 4. THE TRAPS Four foot-guns that bite every data scientist at least once. Views vs copies: slicing returns a view. Modifying a slice modifies the original. Use .copy() when you need independence. Integer overflow: int32 silently wraps at ±2.1 billion. Sum large integer arrays as int64 or float64. Shape mismatch: (n,) is not the same as (n, 1). The first is 1D, the second is 2D. Many bugs live in this gap. Hidden copies: fancy indexing with a boolean or integer array always returns a copy, even for a read. 🔥 THE BOTTOM LINE: The anatomy of NumPy is a balance between structure (dtype, shape, strides), math (vectorized C and BLAS calls), and iteration (broadcasting to avoid explicit loops). Master these three, and every Python data library built on NumPy (pandas, scikit-learn, PyTorch) starts to make sense at a lower level. #Python #NumPy #DataScience #MachineLearning #MLEngineering #AI
2 Comments
Like Comment
To view or add a comment, sign in
Ankur Raj
1w
Report this post
******Step-by-Step: How to Build a Simple AI Agent from Scratch Using an IDE (Beginner-Friendly Technical Guide)****** Many people talk about AI Agents. Very few explain how to actually build one from zero. Here’s a complete hands-on example using Python + OpenAI API where we create a simple AI Agent that reads a text file and generates action items automatically. No frameworks. No shortcuts. Pure fundamentals. What This Agent Will Do Input → Read meeting_notes.txt Process → Understand content Output → Generate structured action items Step 1: Install Required Tools Install: Python (3.10 or higher) VS Code (or IntelliJ / PyCharm) OpenAI Python SDK Run this in terminal: pip install openai python-dotenv Step 2: Create Project Structure Create a folder: ai-agent-demo Inside it create: main.py agent.py meeting_notes.txt .env Step 3: Add OpenAI API Key Open .env file Paste: OPENAI_API_KEY=your_api_key_here Save it. Step 4: Add Sample Input File Open meeting_notes.txt Paste: Rahul will prepare sprint report by Monday Ankur will review automation failures Team will finalize regression scope tomorrow Save it. Step 5: Create Agent Logic File Open agent.py Paste this code: from openai import OpenAI import os from dotenv import load_dotenv load_dotenv() client = OpenAI(api_key=os.getenv("OPENAI_API_KEY")) def generate_action_items(text): prompt = f""" Extract action items from the following meeting notes. Return output as bullet points. Meeting Notes: {text} """ response = client.chat.completions.create( model="gpt-4.1-mini", messages=[{"role": "user", "content": prompt}] ) return response.choices[0].message.content Step 6: Create Main Execution File Open main.py Paste this code: from agent import generate_action_items def read_notes(): with open("meeting_notes.txt", "r") as file: return file.read() def run_agent(): notes = read_notes() output = generate_action_items(notes) print("\nGenerated Action Items:\n") print(output) if name == "main": run_agent() --- Step 7: Understand the Prompt Used This is the intelligence layer of your agent: "Extract action items from the following meeting notes. Return output as bullet points." Prompt = behavior Model = brain Code = execution pipeline Change prompt → agent changes capability Example variations: Summarize notes Create Jira tickets Generate test cases Extract risks Create email summary Step 8: Run the AI Agent Open terminal inside project folder Run: python main.py Output appears like: • Rahul prepares sprint report by Monday • Ankur reviews automation failures • Team finalizes regression scope tomorrow Agent working successfully What Makes This an AI Agent? Because it: Takes input Applies reasoning using LLM Executes instruction via prompt Produces structured output #ArtificialIntelligence #GenerativeAI #LLM #OpenAI #PromptEngineering #AIEngineering
Like Comment
To view or add a comment, sign in
Parviz Ansari
1w
Report this post
Python's Pro Tips : 🚀 What I learned building my first real Python dashboard (no one talks about this) Everyone talks about: 👉 Pandas 👉 Plotly 👉 AI tools But the real breakthrough for me was NOT libraries… It was understanding how Python actually thinks: 💡 1. Indentation = Logic Not formatting. Not style. 👉 One wrong space = broken program. GOLDEN RULE 👉 Every block must follow: if something: code (4 spaces) deeper code (8 spaces)à if file: → 0 spaces inside it → 4 spaces inside if len(...) → 8 spaces 🔒 BRACKETS Every: ( → must close ) [ → must close ] { → must close } WHY 0 / 4 / 8 SPACES? Python doesn’t use { } like JS… 👉 it uses INDENTATION = STRUCTURE 🔥 YOUR 3 if STATEMENTS (EXPLAINED CLEAN) ✅ LEVEL 1 (0 spaces) if file: 👉 This is TOP LEVEL 👉 Means: “only run everything below IF file is uploaded” ✅ LEVEL 2 (4 spaces) if categorical_cols: 👉 This is INSIDE if file 👉 Means: “only run this IF file exists AND categorical columns exist” ✅ LEVEL 3 (8 spaces) if selected_click != "None": 👉 This is INSIDE both above 👉 Means: “only run if: file exists categorical exists user selected something” 🎯 SIMPLE WAY TO SEE IT Think like this: IF file exists: DO A IF categorical exists: DO B IF user clicked: DO C 📊 VISUAL TREE (VERY IMPORTANT) if file: ← level 0 df = ... if categorical_cols: ← level 4 selected_click = ... if selected_click != "None": ← level 8 filtered_df = ... 💡 2. Execution Order matters 👉 You can’t use a variable before it exists. 👉 “Create → then use” (simple, but critical) 💡 3. Scope is everything If something is inside: if file: 👉 it ONLY exists there. 💡 4. Structure > Syntax You can know all libraries… But without structure, nothing works. 💡 5. Build → Break → Fix → Repeat That’s the real learning loop. Not tutorials. 🔥 After struggling with these, I built: ✔ Dynamic dashboard (multi-KPI) ✔ Power BI–style filtering ✔ Multi-chart engine ✔ RCA (insight layer coming next) 📌 Biggest takeaway: Stop chasing tools. Learn how things flow. Everyone learns in different way for me is outside à So first understand the whole picture (Structure) then see what section goes where and each section have what scripts inside it and how they structured (Spacing…..) That’s when everything clicks. Have difficult undertesting it I can explain in Mental Model: “Python Rooms & Floors” (only the first section if the image _Having the right Spacing_ kept me confused for while) :-) Next step: 📡 Turning this into a 💎 A commercial-grade telecom intelligence platform. Let’s see where this goes 🚀 Thanks for your impressions, feedbacks & comments
Like Comment
To view or add a comment, sign in
Ravi Kumar Pal
2w
Report this post
7 Days of Advanced Python — Learning Beyond Basics Day 3 — Making output readable and data reliable Over the last two days, I improved how I set up projects and how I write/debug code. But today I noticed something else. Even when the code is correct, understanding the output and managing data properly is still a challenge. Unstructured prints, messy logs, and loosely defined data can quickly make even simple projects harder to maintain. So today I explored three things that changed how I think about output and data handling: Rich, Pydantic, and structured outputs (Instructor-style approach). --- Rich — Making the terminal actually readable Before this, I mostly relied on print statements or basic logs. The problem is not just debugging — it’s readability. Rich transforms the terminal into something much more expressive. With minimal effort, you get: • Beautiful formatted output • Highlighted logs and errors • Tables, JSON formatting, and better tracebacks Compared to plain print: • More readable output • Better debugging clarity • Faster understanding of program state Documentation: https://lnkd.in/d457WDAA --- Pydantic — Making data structured and reliable Earlier, I passed data around as dictionaries without strict validation. It works… until it doesn’t. Pydantic introduces structure. You define what your data should look like, and it ensures correctness automatically. What stood out: • Data validation by default • Clear structure using models • Type safety improves reliability Compared to raw dictionaries: • Fewer runtime errors • Cleaner and predictable data flow • Easier to scale into larger systems --- Structured Outputs — Thinking beyond scripts This is where things started to feel more “production-level”. Instead of handling loose outputs, I explored structured outputs — where responses follow a defined schema. This is especially useful when working with APIs or AI systems. Why it matters: • Consistent outputs • Easier parsing and integration • Reduces ambiguity in responses This approach shifts thinking from: “just returning data” → “returning well-defined data” Learn more: https://lnkd.in/dU4AAPaJ --- What changed for me today: I stopped focusing only on writing code that works. Instead, I started focusing on writing code that is: easy to read, easy to debug, and easy to trust. Because in real systems, clarity and structure matter just as much as correctness. --- Curious — do you focus more on writing code, or on making your output and data clean as well? #Python #AdvancedPython #CleanCode #Pydantic #Rich #StructuredData #LearningInPublic
Like Comment
To view or add a comment, sign in
Chandrakanth Ande Reddy
1mo Edited
Report this post
Spark Day :4 Understanding RDD Transformations 🎯New RDD Creation: Transformations are operations that take an existing RDD as input and produce a brand new RDD as output. 🎯Immutability: RDDs are immutable, meaning once an RDD is created, its data cannot be modified or altered. 🎯Lazy Evaluation: Transformations are performed lazily. Spark simply records the sequence of operations (the lineage) and does not execute them immediately. 🎯Triggered by Actions: The entire chain of transformations is only processed and executed when an Action (like collect() or count()) is explicitly called. 🎯Common Transformations Reference 1. map Transforms each element of the original RDD by applying a specific function to it, returning a new RDD. Python rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5, 6]) new_rdd = rdd.map(lambda x: x * 2) 2. filter Returns a new RDD containing only the elements that satisfy a specified boolean condition. Python rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5, 6]) filter_rdd = rdd.filter(lambda x: x > 3) 3. flatMap Similar to map, but each input item can be mapped to zero or more output items. It effectively "flattens" the results into a single sequence. Python rdd = spark.sparkContext.parallelize(["hello world", "goodbye world"]) words_rdd = rdd.flatMap(lambda line: line.split(" ")) 4. groupByKey Groups the values associated with each unique key in the RDD, returning a new RDD of key-value pairs where the value is an iterable of the grouped items. Python rdd = spark.sparkContext.parallelize([(25, 50000), (30, 70000), (25, 60000), (35, 90000), (30, 80000)]) grouped_rdd = rdd.groupByKey() # Example of processing the grouped values: grouped_rdd1 = grouped_rdd.mapValues(lambda x: sum(x)) 5. reduceByKey Aggregates the values for each key using a provided function. It is generally more efficient than groupByKey because it performs a local merge on each partition before sending data across the network. Python rdd = spark.sparkContext.parallelize([(25, 50000), (30, 70000), (25, 60000), (35, 90000), (30, 80000)]) reduced_rdd = rdd.reduceByKey(lambda x, y: x + y) 6. join Performs an inner join between two RDDs based on their matching keys, returning a new RDD of key-value pairs. Python rdd1 = sc.parallelize([(1, ('Alice', 25)), (2, ('Bob', 30)), (3, ('Charlie', 35))]) rdd2 = sc.parallelize([(1, ('New York', 'Engineer')), (2, ('San Francisco', 'Artist')), (3, ('Boston', 'Doctor'))]) joined_rdd = rdd1.join(rdd2) 7. distinct Removes duplicate elements, returning a new RDD containing only the unique elements from the original RDD. Python rdd = sc.parallelize([1, 2, 3, 4, 3, 2, 1, 5]) distinct_rdd = rdd.distinct() ➡️➡️➡️P.T.O🚪 #SQL,#pyspark,#dataengnieering,#dataanalyst,#database,#Bigdata,#DataVisualization,#BusinessIntelligence
Like Comment
To view or add a comment, sign in

3,695 followers

317 Posts

View Profile Follow

Rebuilding Python Foundation with Core Basics and OOP

More Relevant Posts

Explore related topics

Explore content categories