𝗗𝗮𝘆 𝟮 𝗼𝗳 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗣𝘆𝘁𝗵𝗼𝗻 🐍 𝗗𝗮𝘁𝗮 𝗦𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲𝘀 𝗔𝗿𝗲 𝗧𝗵𝗲 𝗥𝗲𝗮𝗹 𝗙𝗼𝘂𝗻𝗱𝗮𝘁𝗶𝗼𝗻 Yesterday was about syntax and conditionals. Today was about understanding how data is actually stored and organized. 𝗧𝗼𝗱𝗮𝘆 𝗜 𝗖𝗼𝘃𝗲𝗿𝗲𝗱: • Dictionaries • Dictionary methods • Nested dictionaries • Sets • Set operations 𝗗𝗶𝗰𝘁𝗶𝗼𝗻𝗮𝗿𝗶𝗲𝘀 — 𝗞𝗲𝘆-𝗩𝗮𝗹𝘂𝗲 𝗣𝗼𝘄𝗲𝗿 Dictionaries are not like lists. They store data as key-value pairs. Example: { "name": "Yash", "age": 22, "skills": ["Python", "JavaScript"] } Key learnings: ✔ Keys must be immutable ✔ Accessing non-existing keys throws errors ✔ .get() is safer than direct access ✔ Dictionaries are extremely powerful for real-world data modeling I practiced: Creating dictionaries Accessing and updating values Adding new key-value pairs Removing elements Looping through keys and values Creating nested structures 𝗦𝗲𝘁𝘀 — 𝗨𝗻𝗶𝗾𝘂𝗲𝗻𝗲𝘀𝘀 𝗠𝗮𝘁𝘁𝗲𝗿𝘀 Sets automatically remove duplicates. That alone makes them powerful. Example: {1, 2, 2, 3, 4} → {1, 2, 3, 4} I learned: ✔ Sets are unordered ✔ No duplicate values ✔ Fast membership checking ✔ Union, Intersection, Difference operations Set operations feel very mathematical — and extremely useful. 𝗣𝗿𝗼𝗯𝗹𝗲𝗺𝘀 𝗜 𝗦𝗼𝗹𝘃𝗲𝗱 𝗧𝗼𝗱𝗮𝘆 1️⃣ Count frequency of each character in a string using dictionary 2️⃣ Store student data (name, marks) and calculate average 3️⃣ Merge two dictionaries 4️⃣ Remove duplicates from a list using set 5️⃣ Find common elements between two lists using set intersection 6️⃣ Check if two strings are anagrams using dictionary counting 7️⃣ Create a simple phonebook using dictionary 𝗥𝗲𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗧𝗼𝗱𝗮𝘆 If you don’t understand dictionaries properly, you can’t build: • APIs • Backend systems • JSON handling • AI data pipelines • Configuration systems Most real-world applications rely heavily on key-value structures. 𝗗𝗮𝘆 𝟮 𝗖𝗼𝗻𝗰𝗹𝘂𝘀𝗶𝗼𝗻 Python is not about writing print statements. It’s about structuring data correctly. Tomorrow: Loops and functions — where logic becomes real. Consistency > Motivation. #Python #DataStructures #DeveloperJourney #BackendDevelopment #BuildInPublic #100DaysOfCode
Mastering Dictionaries and Sets in Python
More Relevant Posts
-
🚀 𝐏𝐲𝐭𝐡𝐨𝐧 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠 𝐉𝐨𝐮𝐫𝐧𝐞𝐲 Today’s focus was understanding Python concepts with real-world application 👇 🔹 Tuple (Immutable) Cannot be modified after creation Faster than lists Methods: count(), index() Supports slicing, concatenation +, repetition * Tuple packing → a = 10,20,30 💼 Use: Fixed data (invoices, records) 🔹 Dictionary (Key-Value Mapping) Access via keys (not index) Add/Update: dict[key] = value Methods: get(), keys(), values(), items(), update() Removal: pop(), popitem(), del, clear() 💼 Use: Supplier / business data 🔹 Set (Unique & Unordered) No duplicates, no indexing Key ops: ✔ Union | → combine ✔ Intersection & → common ✔ Difference - → pending ✔ Symmetric ^ → mismatch 💼 Use: Remove duplicate customers, data cleaning 🔹 Data Structures List [] (mutable) | Tuple () (immutable) Set {} (unique) | Dict {k:v} (mapping) 🔹 Static vs Dynamic Coding Static → fixed values Dynamic → user input 🔹 Input & Type Casting input() → string Convert: int(), float() eval() executes input (syntax matters ⚠️) String is iterable → list(input()) 🔹 print() & Output print("Hello", name) (best practice) Concatenation needs same type ❌ string + int → error ✔ Fix: str() or format() 🔹 String Formatting & Errors {} auto | {0} manual (don’t mix ❌) Errors learned: TypeError, IndexError, ValueError 🔹 Other Concepts Multiple operations → tuple (A-B, A*B, A/B) len(), type() del → delete variable \n → new line | r"" → raw string 💼 Business Insight: Set → remove duplicates Dict → manage structured data Tuple → store fixed data 👉 Right data structure = better performance & decisions Python is not just coding — it’s about solving real business problems logically. #Python #DataAnalytics #BusinessAnalytics #LearningJourney
To view or add a comment, sign in
-
Automate the boring stuff: Link Edition. Stop managing your bookmarks and start automating your workflow. Here is a 30-second automation tip using Python: If you have a list of links in a CSV, let AI sort them and insert them into a formatted HTML report automatically. ```python links = ["link1.com", "link2.com", "link3.com"] html_report = "<h1>AI Curated Links</h1>" for link in links: # Automation logic: Insert links into a beautiful structure html_report += f'<li><a href="http://{link}">{link}</a></li>' # Save the file automatically with open("report.html", "w") as file: file.write(html_report) print(" Your link report has been generated by AI!") ``` Output: A perfect HTML document ready to share. This is just the tip of the iceberg. What repetitive "link" task do you want to automate? #AIAutomation #PythonCoding #Efficiency #TechTips
To view or add a comment, sign in
-
From Messy Data to Beautiful documents : A Python Data Transformation story!! Ever wondered how to turn inconsistent, messy data into clean, professional documents? Here's a real-world example that shows the power of robust data parsing! The Challenge: Raw data comes in different formats. Take staff data for example. Pipe-delimited: "John Doe | Manager | NYC | 555-123-1234 | john@email.com -> Mixed formats with phone numbers scattered -> Sometimes just plain text paragraphs -> Inconsistent spacing and structure The Solution: Python is smart enough to handle ANY format: 1. Multi-Strategy Parsing: Try pipe-delimited first, then regex patterns, finally fallback to plain text. 2. Data Validation: Extract phones/emails using regex patterns, validate structure. 3. Flexible Output: Structure what we can, preserve what we can't as paragraphs. The Magic Moment: All this messy data is transformed into JSON format: [ {"name":"John Doe, Manager","location":"NYC","contact":"555-123-1234 | john@email.com"}, ... so on ] Document Generation: Word template simply loops through clean data: {% for item in STAFF_ROWS %} {{item.name}} {{item.location}} {{item.contact}} {% endfor %} Key Takeaways: 1. Plan for data consistency from day one. 2. Use multiple parsing strategies with graceful fallbacks 3. Preserve original data when parsing fails 4. Separate data transformation from presentation logic This approach transforms 4 different data formats into ONE consistent output format. The result? Professional documents generated automatically, no matter how messy the source data! What's you experience with messy data transformations? Share your war stories below !! #Python #jinja2 templating #docxtemplate #datatransformation #DataEngineering #Automation
To view or add a comment, sign in
-
LLMs are smart. But they're clueless about YOUR data. You've got 3 options: → Long context window — $13M/year → Fine-tuning — weeks of work → RAG — $11K. 50 lines. Done. RAG wins. Every time. Think of it like a library: Without RAG → AI guesses → WRONG With RAG → AI searches your docs → grabs the relevant pages → reads them → CORRECT The entire pipeline is 5 steps: 1. LOAD your documents 2. CHUNK them into pieces 3. EMBED — convert to vectors 4. STORE in a vector database 5. RETRIEVE + GENERATE the answer That's it. ~50 lines of Python. Cost per query: $0.0001 Response time: ~1 second Stack: Python + LangChain + OpenAI + ChromaDB The carousel breaks down every step — with code snippets and visual diagrams. Full source code links in slide 9. What document would you RAG first? #RAG #Python #LLMs #BuildWithAI #AIEngineering ### Sources - [HuggingFace — Code a simple RAG from scratch](https://lnkd.in/gMYWhrRA) - [FreeCodeCamp — Learn RAG from Scratch (Python AI Tutorial)](https://lnkd.in/gmfecNfb) - [KDnuggets — 7 Steps to Build a Simple RAG System](https://lnkd.in/grFQwBDH) - [GitHub — awesome-llm-apps (60K+ stars, 100+ production-ready AI apps)](https://lnkd.in/gAQp3WRY)
To view or add a comment, sign in
-
“Data cleaning is where real data science begins.” // Today I spent time working on a real-world CSV dataset using Pandas in Python—and it turned out to be a great reminder that data rarely comes in a “ready-to-use” format. At first glance, everything looked fine after loading it with read_csv(). But as I started exploring the dataset more deeply using functions like info(), describe(), and isnull().sum(), a different story emerged: • Missing values across multiple columns • Inconsistent data formats • Some columns that added little to no analytical value • A few unexpected duplicates Instead of rushing into model building, I focused on understanding and preparing the data: • Dropped irrelevant columns using drop() • Handled missing values (both removal and basic imputation) • Checked for duplicate records and removed them • Standardized column formats where needed • Took time to actually understand what each feature represents One key realization from this exercise: Good models don’t come from complex algorithms alone—they come from clean, meaningful, and well-prepared data. It’s easy to get excited about machine learning models, but the real impact lies in the quality of the data you feed them. --Data cleaning may not be the most glamorous part of the workflow, but it’s definitely one of the most critical. //Grateful for the guidance and support from teacher Mohit Payasi sir throughout this learning process—having the right direction makes a huge difference when building strong fundamentals.🙏🏻🌟 --Strong foundations today lead to better, more reliable models tomorrow./ ''Would love to learn from others—what are your must-do steps when working with messy, real-world datasets?'' #DataScience #Python #Pandas #DataCleaning #MachineLearning #DataAnalytics #LearningJourney #Programming
To view or add a comment, sign in
-
Porting code is easy. Redesigning for downstream trust is where the real value lives. 💡 I recently migrated a legacy Azure Synapse pipeline over to Python. The easy path would have been a straight 1:1 translation: keeping it as a procedural script that spits out a dynamic dictionary. Instead, I took a step back and rebuilt it as a modular Python package. Here are three architectural shifts I made to ensure "audit-grade" integrity for downstream consumers: 1️⃣ Stop trusting exit codes: I stopped relying on the script "finishing." By anchoring the extraction on strict START and FINISH markers inside the plain text logs, silent partial failures now immediately halt the pipeline. 2️⃣ Dictionaries are a trap: Downstream systems (from CFO dashboards to AI agents) need absolute certainty. I swapped out dynamic dictionaries for rigidly typed Pydantic objects. Giving consumers a reliable, unchanging contract saves everyone headaches. 3️⃣ Modularize the messy stuff: By separating the raw text-parsing from the ever-changing business rules, the output became a clean, reusable object. The downstream APIs shouldn't have to deal with the reality of the source logs. This project was a great reminder of why I love the "translation" seat. It’s not just about writing Python—it’s about Stakeholder Engineering. It's understanding who consumes the data and building an architecture they can actually trust. If you love nerding out over data architecture or building enterprise integrations, let's connect and chat! ☕️ #SolutionsArchitecture #DataEngineering #Python #Pydantic #StakeholderEngineering #EnterpriseTech
To view or add a comment, sign in
-
-
I pulled 2 million rows into Python to do a GROUP BY. My senior analyst saw it and said, "That's just a query." That moment changed how I think about data tools forever. Most people jump to Python when they should be writing SQL. Here's the honest breakdown of what I mean: USE SQL WHEN: → Your data lives in a database → You need to filter, join, group, aggregate → You're touching millions of rows → Someone else has to maintain your logic → Speed matters — let the server do the work USE PYTHON WHEN: → You're building ML models, NLP, statistical tests → Complex reshaping that Pandas handles in 3 lines → Automating multi-step workflows → Building APIs, pipelines, or models → Visualizations beyond what BI tools can do The mistake I used to make, and I see others constantly making: Pulling 500K–2M rows into a DataFrame… to do something SQL does natively in seconds. If you can't write the SQL version first, you don't fully understand what your Python code is doing. Both tools are powerful. The skill is knowing which one to reach for.
To view or add a comment, sign in
-
-
We built a new programming language for AI & Data - 'ThinkingLanguage' in 5 days capable of transferring 1 Billion rows in 3 seconds. Every data team runs the same stack: Python for glue code, SQL for transforms, Spark or dbt for scale, YAML for orchestration. Four languages, four mental models, four places for bugs. What if one language could do it all? ThinkingLanguage (TL) is a purpose-built language for Data Engineering and AI. The pipe operator is a first-class citizen. Tables, schemas, filters, joins, and aggregations are native - not library calls. let users = read_csv("users.csv") users |> filter(age > 30) |> join(orders, on: id == user_id) |> aggregate(by: name, total: sum(amount)) |> sort(total, "desc") |> show() What's under the hood: - Apache Arrow columnar format - DataFusion query engine with lazy evaluation and automatic optimization - CSV, Parquet, and PostgreSQL connectors - 1M rows filtered + aggregated + sorted in 0.3 ms - Written in Rust Includes a JIT compiler (Cranelift/LLVM), native AI/ML operations (train, predict, embed), streaming pipelines with Kafka, GPU (CUDA, ROCm). Python FFI Bridge (Run/Call Python Libraries) and a full ecosystem with notebooks and a package registry. Download via npx, ssh native installer, crates, github This is open source (Apache Licence). If you're a data engineer tired of context-switching between five tools, or a Rust developer who wants to contribute to something new - check it out. (link below) #DataEngineering #OpenSource #Rust #Programming #ApacheArrow #ThinkingLanguage #ThinkingDBx #Data #AI #Python #DataFusion #1BRC
To view or add a comment, sign in
-
-
🚀 𝐏𝐨𝐥𝐚𝐫𝐬 𝐯𝐬 𝐏𝐚𝐧𝐝𝐚𝐬: 𝐀𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐮𝐫𝐞 & 𝐏𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞 𝐄𝐱𝐩𝐥𝐚𝐢𝐧𝐞𝐝 When it comes to data analytics in Python, Pandas has long been the go-to library. But as datasets grow larger and performance becomes critical, Polars is emerging as a powerful alternative. 💡 𝐋𝐞𝐭’𝐬 𝐛𝐫𝐞𝐚𝐤 𝐢𝐭 𝐝𝐨𝐰𝐧: 🔹 𝐀𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐮𝐫𝐞 Pandas is built on NumPy and primarily uses a single-threaded execution model. This makes it simple and flexible—but slower with large datasets. Polars, on the other hand, is built on Apache Arrow and designed for modern hardware. It uses a columnar memory format and supports multi-threaded execution by default. 🔹 𝐏𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞 Pandas performs well for small to medium datasets, but can struggle with speed and memory usage at scale. Polars is optimized for high performance: ✔ Faster query execution ✔ Lower memory usage ✔ Built-in parallel processing 🔹 𝐋𝐚𝐳𝐲 𝐯𝐬 𝐄𝐚𝐠𝐞𝐫 𝐄𝐱𝐞𝐜𝐮𝐭𝐢𝐨𝐧 Pandas follows an eager execution model—operations run immediately. Polars introduces lazy execution, where queries are optimized before execution, leading to significant performance gains. 🔹 𝐖𝐡𝐞𝐧 𝐭𝐨 𝐔𝐬𝐞 𝐖𝐡𝐚𝐭? 👉 Use Pandas for quick analysis, prototyping, and smaller datasets 👉 Use Polars for large-scale data processing, analytics pipelines, and performance-critical tasks 𝐓𝐡𝐞 𝐁𝐨𝐭𝐭𝐨𝐦 𝐋𝐢𝐧𝐞: Polars is not just a replacement—it’s a next-gen data processing engine built for speed, scalability, and efficiency. As data grows, choosing the right tool can make all the difference. 👉 Follow AVA® - An Orange Education Label for more deep dives into modern data tools and technologies. #AVA #Python #Polars #Pandas #DataAnalytics #BigData #DataEngineering #TechTrends #Performance #ApacheArrow #Learning #OrangeEducation #Publishers #Technology #itandsoftware
To view or add a comment, sign in
-
-
Extracting structured tables from PDFs is harder than it looks. PDF files do not store tables as structured data. Instead, they position text at specific coordinates on the page. Table extraction tools must reconstruct the structure by determining which values belong in which rows and columns. The problem becomes even harder when tables include multi-level headers, merged cells, or complex layouts. To explore this problem, I experimented with three tools designed for PDF table extraction: LlamaParse, Marker, and Docling. Each tool takes a different approach. Performance overview: • Docling: Fastest local option, but struggles with complex tables • Marker: Handles complex layouts well and runs locally, but is much slower • LlamaParse: Most accurate on complex tables and fastest overall, but requires a cloud API In this article, I share the code, examples, and results from testing each tool. 🚀 Full article: https://bit.ly/40SZ0CC #PDFExtraction #Python #DataEngineering
To view or add a comment, sign in
More from this author
Explore related topics
Explore content categories
- Career
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Hospitality & Tourism
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development