Datacmp v3.0.0: Open-Source Python Library for Data Cleaning and EDA

6mo Edited

Datacmp v3.0.0 (Now Live on PyPI) I’m excited to announce the release of Datacmp v3.0.0, a major new version of my open-source Python library for data cleaning and exploratory data analysis (EDA). Since its first release, Datacmp has surpassed 2,000+ downloads on PyPI This update marks a huge milestone in my journey as an AI Developer — transforming Datacmp from a lightweight utility into a comprehensive, production-ready framework built for data scientists, analysts, and machine learning engineers. What’s New in v3.0.0 • OOP + Functional APIs with method chaining for modern workflows • Command-Line Interface (CLI) for running complete data pipelines • Intelligent data cleaning for missing values, outliers, and duplicates • Comprehensive profiling with detailed statistics and correlation analysis • Beautiful visualizations and interactive HTML/TXT reports • YAML-based configuration for fully reproducible setups Explore Datacmp • PyPI: https://lnkd.in/diJced5z • GitHub: https://lnkd.in/dZXJD6K6 This release reflects months of continuous improvement, testing, and learning driven by my passion for building powerful, open-source AI tools that simplify real-world data workflows. I’m grateful for how far this project has come, and I’m even more excited for what’s next including new integrations, advanced visual analytics, and AI-powered automation features in upcoming versions. Thank you to everyone who’s supported and followed my journey so far. If you find Datacmp useful, I’d love your feedback or a ⭐️ on GitHub it helps fuel the next evolution. #Python #DataScience #MachineLearning #OpenSource #AI #SoftwareDevelopment #EDA #DataCleaning #DataAnalysis #PyPI #Datacmp #MoustafaMohamed #MoustafaMohamed01

To view or add a comment, sign in

More Relevant Posts

Abdullah Kaim Khani
5mo
Report this post
🔥 “𝗧𝗮𝗹𝗸 𝗧𝗼 𝗗𝗮𝘁𝗮𝗯𝗮𝘀𝗲 𝗶𝗻 𝗣𝗹𝗮𝗶𝗻 𝗘𝗻𝗴𝗹𝗶𝘀𝗵 𝗧𝗲𝘅𝘁” 𝗠𝘆 𝗡𝗲𝘄 𝗔𝗜 𝗣𝗿𝗼𝗷𝗲𝗰𝘁 ! 💬💻 Ever wished you could 𝗾𝘂𝗲𝗿𝘆 𝗮 𝗱𝗮𝘁𝗮𝗯𝗮𝘀𝗲 𝗷𝘂𝘀𝘁 𝗯𝘆 𝘁𝗮𝗹𝗸𝗶𝗻𝗴 𝘁𝗼 𝗶𝘁? No SQL. No complex syntax. Just plain English and boom 💥 instant results! Introducing my latest project: 🚀 𝗧𝗮𝗹𝗸 𝗧𝗼 𝗗𝗮𝘁𝗮𝗯𝗮𝘀𝗲 𝗶𝗻 𝗣𝗹𝗮𝗶𝗻 𝗘𝗻𝗴𝗹𝗶𝘀𝗵 𝗧𝗲𝘅𝘁 Now you can simply ask: 👉 “Who has the highest marks?” And it instantly runs: SELECT * FROM STUDENT WHERE MARKS = (SELECT MAX(MARKS) FROM STUDENT); then shows the real data automatically! 🤯 💡 𝗕𝘂𝗶𝗹𝘁 𝗨𝘀𝗶𝗻𝗴: 🧠 𝗟𝗮𝗻𝗴𝗖𝗵𝗮𝗶𝗻 + 𝗢𝗽𝗲𝗻𝗔𝗜 (𝗚𝗣𝗧-𝟯.𝟱-𝗧𝘂𝗿𝗯𝗼) for natural language to SQL conversion ⚡ 𝗦𝘁𝗿𝗲𝗮𝗺𝗹𝗶𝘁 -> for an interactive and clean user interface 🗄️ 𝗦𝗤𝗟𝗶𝘁𝗲 ->for database storage and queries 🐍 𝗣𝘆𝘁𝗵𝗼𝗻 -> tying it all together seamlessly 🔐 𝗣𝗿𝗶𝘃𝗮𝗰𝘆 𝗙𝗶𝗿𝘀𝘁: Your data stays completely local — nothing is uploaded to OpenAI. 🎥 𝗜’𝘃𝗲 𝘂𝗽𝗹𝗼𝗮𝗱𝗲𝗱 𝗮 𝗳𝘂𝗹𝗹 𝗱𝗲𝗺𝗼 𝘃𝗶𝗱𝗲𝗼 𝗯𝗲𝗹𝗼𝘄! Come see it in action I’m sure you’ll love how easy querying data becomes. 💻 𝗖𝗵𝗲𝗰𝗸 𝗼𝘂𝘁 𝘁𝗵𝗲 𝗳𝘂𝗹𝗹 𝗰𝗼𝗱𝗲 𝗼𝗻 𝗚𝗶𝘁𝗛𝘂𝗯: 👉 https://lnkd.in/ey8xMi-E ✨ Let me know your thoughts how cool would it be if every database could talk back? #AI #LangChain #OpenAI #Streamlit #SQL #DataScience #Python #MachineLearning #Chatbot #Innovation #ProjectShowcase #AbdullahProjects

1 Comment
Like Comment
To view or add a comment, sign in
Karthick Raja Mohan
6mo Edited
Report this post
I've just published langgraph-crosschain v0.1.2 on PyPI! It's a framework that addresses a specific limitation in LangGraph - nodes in different chains to communicate directly with each other. What it enables: - Direct peer-to-peer node communication between chains - Shared state management across multiple agents - Event-driven architecture with pub/sub patterns - Thread-safe operations with comprehensive error handling - Production-ready with extensive test coverage Practical applications: ✅ Distributed AI agent systems ✅ Complex workflow orchestration ✅ Microservices-style AI architectures ✅ Map-reduce patterns for agent collaboration Install: pip install langgraph-crosschain PyPI: https://lnkd.in/g3SDuWmJ GitHub: https://lnkd.in/gKXtk492 Feedback and contributions are welcome! #Python #AI #MachineLearning #LangChain #OpenSource #MultiAgentSystems #ArtificialIntelligence #PyPI #SoftwareEngineering #Innovation #AgenticAI #LLM

GitHub - karthyick/langgraph-crosschain github.com
Like Comment
To view or add a comment, sign in
Mohd Aquib
6mo Edited
Report this post
🧠 Excited to share my latest open-source project: RAGenius - A Production-Ready RAG System! After experimenting with Retrieval-Augmented Generation, I built a system that actually works in production. Here's what makes it different: ✅ Multi-format document support (PDF, Excel, JSON, DOCX, CSV) ✅ Real-time streaming responses for better UX ✅ Incremental vector database updates (no rebuilding!) ✅ REST API built with FastAPI ✅ Persistent vector storage with ChromaDB The Tech Stack: 🐍 Python + FastAPI 🤖 Azure OpenAI (GPT-4 + Embeddings) 🗄️ ChromaDB for vector storage 🔗 LangChain for document processing Why RAG? Traditional LLMs are limited to their training data. RAG combines LLMs with YOUR documents, reducing hallucinations and providing accurate, contextual answers based on your domain knowledge. Key Features: → Upload documents via API → Query with streaming or basic mode → Smart chunking with overlap for better context → Async operations for scalability → Production-ready error handling I've documented everything in detail on my blog and the entire codebase is open-source on GitHub. Would love to hear your thoughts on RAG systems and how you're using them in production! 💬 #Python #MachineLearning #AI #OpenSource #FastAPI #RAG #LLM #AzureOpenAI #SoftwareEngineering #DataScience 🔗 GitHub: https://lnkd.in/gqrdK_n5 📝 Blog Post: https://lnkd.in/gH7KE4Zu
Like Comment
To view or add a comment, sign in
Rishabh Rohil
5mo
Report this post
At TAMU Datathon 2025, I built MCP Context Engine a FastAPI-based system that connects real-world APIs like Google Calendar and GitHub to an intelligent reasoning layer. It interprets natural-language queries such as “Am I free tomorrow afternoon?” and produces structured, AI-ready context for assistants and scheduling agents. The goal was to reimagine how AI models access and reason over real-world data making context retrieval modular, explainable, and language-aware. Inspired by the Model Context Protocol (MCP), this project shows how backend design and AI reasoning can intersect in powerful ways. Code: https://lnkd.in/gtsHwahn #TAMUDatathon #AI #SoftwareEngineering #FastAPI #Backend #Python #LLM #Hackathon #SystemDesign #FAANG #OpenSource #ContextEngine #MachineLearning

GitHub - rishabh23rohil/mcp-context-engine github.com

3 Comments
Like Comment
To view or add a comment, sign in
Anumohan Mohanan Sudha
6mo Edited
Report this post
Series Post 7 : Pydantic - Data Integrity in the world of AI Once LLM orchestration was achieved using LangChain and LangGraph, the next big challenge was ensuring data integrity. When you rely on an LLM to generate structured data (like business rules or query parameters), you must assume the output is chaos. This is where Pydantic played a huge role and helped my Python services to be enterprise grade. Pydantic: The Contract Enforcer ✍️ Guaranteed Schema: Pydantic models define a strict schema using Python type hints. This guarantees that any JSON or dictionary passed to my system is immediately validated, preventing garbage data from entering my pipeline. LLM Output Control: I used Pydantic to generate a JSON Schema for the LLM to adhere to. I forced my Llama 3 agent to output JSON that strictly conformed to this schema, dramatically improving the reliability of the structured output. Rule Engine Integrity: For my dynamic Rules Engine, Pydantic models defined the structure of the executable logic (e.g., IF {field} THEN {action}). This structure ensured the rules were always correct before they were used to generate dynamic SQL. Inter-Service Trust: Pydantic acts as a trust layer between my chaotic data ingestion services (e.g., OCR, web scraper) and my stable core services (Java/Spring Boot), ensuring only clean data is ever transferred. #Pydantic #Python #DataValidation #LLM #SoftwareEngineering #Architecture #PassionProject

2 Comments
Like Comment
To view or add a comment, sign in
Shashi Singh
5mo
Report this post
💡 LeetCode Learning Log — Two Pointer Pattern in Action 🔗 Problems: • https://lnkd.in/d3z_GFHR • https://lnkd.in/d3p8Y5pj Today, I explored how the 𝗧𝘄𝗼 𝗣𝗼𝗶𝗻𝘁𝗲𝗿 𝗣𝗮𝘁𝘁𝗲𝗿𝗻 can be applied to very different problems — one involving 𝗻𝘂𝗺𝗯𝗲𝗿𝘀, and the other 𝘀𝘁𝗿𝗶𝗻𝗴𝘀 — both teaching key ideas for efficient data transformations 1️⃣ Squares of a Sorted Array • 𝗕𝗿𝘂𝘁𝗲 𝗙𝗼𝗿𝗰𝗲: Square each number, then sort the list.• ⏱ Time: O(n log n) 💾 Space: O(n) • Works fine, but wastes time re-sorting an already sorted list. • 𝗢𝗽𝘁𝗶𝗺𝗮𝗹 (𝗧𝘄𝗼 𝗣𝗼𝗶𝗻𝘁𝗲𝗿𝘀):• Start pointers from both ends, compare absolute values, and fill the result in reverse. • ⏱ Time: O(n) 💾 Space: O(n) • Clean, fast, and avoids extra computation. • ✅ Runtime: 11 ms 💾 Memory: 14.38 MB 2️⃣ Reverse String • Used 𝗧𝘄𝗼 𝗣𝗼𝗶𝗻𝘁𝗲𝗿𝘀 from start and end to swap characters in-place. • Eliminated the need for extra space or Python’s slicing tricks. • ⏱ Time: O(n) 💾 Space: O(1) • A simple problem, but great for understanding 𝗶𝗻-𝗽𝗹𝗮𝗰𝗲 𝗱𝗮𝘁𝗮 𝘁𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻. 🧠 Key Learnings • 𝗧𝘄𝗼 𝗣𝗼𝗶𝗻𝘁𝗲𝗿𝘀 = 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝗰𝘆. Whether working with text or numbers, it saves time and memory. • 𝗜𝗻-𝗽𝗹𝗮𝗰𝗲 𝘂𝗽𝗱𝗮𝘁𝗲𝘀 matter when handling large data — same concept applies in data pipelines. • These problems build intuition for 𝗱𝗮𝘁𝗮 𝗺𝗲𝗿𝗴𝗶𝗻𝗴, 𝗰𝗹𝗲𝗮𝗻𝗶𝗻𝗴, and 𝘁𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻 — core skills for any 𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿. • When datasets are sorted, smart pointer movement often beats brute-force iteration. ⚙️ Data Engineering Perspective • “Reverse String” reflects 𝗶𝗻-𝗽𝗹𝗮𝗰𝗲 𝗱𝗮𝘁𝗮 𝘁𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻 — useful in ETL cleaning and masking tasks. • “Squares of a Sorted Array” mirrors 𝗺𝗲𝗿𝗴𝗲 𝗼𝗽𝗲𝗿𝗮𝘁𝗶𝗼𝗻𝘀 where sorted numeric data is processed in one pass. #LeetCode #DSA #Python #TwoPointers #ProblemSolving #CodingJourney #DataEngineer #100DaysOfCode #LearningEveryday #Efficiency #ETL
Like Comment
To view or add a comment, sign in
Yogender Kumar
6mo
Report this post
Today in data engineering, the 'E' (Extract) phase has been successfully completed with the establishment of a secure connection to the database. As a Data Engineer, the focus now shifts to the 'T' (Transform) phase, encompassing tasks such as cleaning, shaping, and validating the raw data using tools like Pandas 🛠️. Key Learning: Introduction to Pandas Pandas emerges as a pivotal tool for data transformation within Python, offering robust capabilities for streamlining the process. The Challenge: Dealing with Messy Data Raw data presents challenges with missing values, inconsistent formats (e.g., 'USA' vs. 'United States'), and duplicates, rendering it unsuitable for analytical insights or AI model training. The Solution: Unleashing the Power of Pandas Pandas introduces the DataFrame, acting as a highly efficient data structure for swift, spreadsheet-like operations through Python commands. Today's Task: Navigating Basic Pandas Operations Practical exercises included tasks like reading a Parquet file, pinpointing missing values, and leveraging simple functions for row filtering and cleansing. Insight from Engineering A crucial realization: Transformation extends beyond mere cleaning to enforcing data contracts. Each line of code in the 'T' phase dictates the quality and structure of the final output. Just as structure enhances usability in design, structured data fosters reliability and trust. Mastery of Pandas equates to setting the quality benchmark for the data pipeline. #DataEngineering #Pandas #ETL #Python #DataTransformation #TechSkills
Like Comment
To view or add a comment, sign in
Anuraag Das
5mo
Report this post
Stop sending JSON to your LLMs rn and use TOON instead , here's why: We're all fighting the context window bottleneck, but we're still sending data in one of the most token-inefficient formats possible. Just look at the difference. JSON format: [{"id": 1, "name": "A"}, {"id": 2, "name": "B"}] (~20 tokens) TOON (Token-Oriented Object Notation): [2]{id,name}: 1,A 2,B (~10 tokens) That's a 50% token savings. But the "Aha!" moment isn't just about saving money. It's that you can now fit twice the data into the same context window. This means: Denser, more accurate RAG context More examples for few-shot learning Complex agent tool schemas that don't eat the entire prompt. We optimize retrievers and prompt chains, but we're still shipping our data to the model like it's 2005. Come on guys its late 2025, lets just enhance the prompting style just like the LLMs are enhancing themselves. In an era of token-based computing, data verbosity is a bug. Fix it. For all the python users: 'pip3 install python-toon' from toon import encode "Happy tooning" #GenAI #LLM #PromptEngineering #TOON #RAG #AI #Python #DataScience
3 Comments
Like Comment
To view or add a comment, sign in
Gautam Sharma
6mo
Report this post
🚀 Just Built a Data Agent for Real-Time Dataset Access from Data.gov! I recently worked on a small but exciting project — a Data Agent powered by Python that connects directly to the Data.gov API to fetch, analyze, and visualize real-time datasets. 💡 Example Shown: Petroleum Consumption Dataset — fetched using a Resource ID and visualized interactively. 🔍 Tech Stack & Flow: Python + Requests + Pandas .env configuration for secure API key management Modular design (data_gov_tool.py + app.py) Example: Real-time petroleum data fetched and displayed from data.gov.in 📂 Source code: https://lnkd.in/dq7ihs54 This project shows how easy it is to connect open government data with AI and automation — unlocking insights for developers, researchers, and analysts. #OpenData #Python #DataScience #APIs #DataAnalytics #AI #Innovation #DataEngineering #GovTech
Like Comment
To view or add a comment, sign in

2,461 followers

View Profile Follow

Datacmp v3.0.0: Open-Source Python Library for Data Cleaning and EDA

More from this author

The Role of AI in Cybersecurity

From AI Beginner to Industry-Ready in Under a Year: My 2025 Journey

Mastering Neural Network Architectures: A Practical Guide with PyTorch Examples

Explore content categories

Datacmp v3.0.0: Open-Source Python Library for Data Cleaning and EDA

More Relevant Posts

More from this author

The Role of AI in Cybersecurity

From AI Beginner to Industry-Ready in Under a Year: My 2025 Journey

Mastering Neural Network Architectures: A Practical Guide with PyTorch Examples

Explore related topics

Explore content categories