DPK Release 1.1.7: Python 3.13 Support & Performance Boosts

View organization page for Data Prep Kit

37 followers

2mo

🎉 New DPK Release: Version 1.1.7 We’re excited to share the latest release of Data Prep Kit, packed with fresh transform capabilities, performance boosts, and expanded compatibility. Here’s a look at what’s new in v1.1.7: ⚙️ Enhancements Python 3.13 Compatibility - Expanded version compatibility to support Python 3.13 Faster Installation with uv -Migrated the repo to use uv, significantly speeding up environment setup and dependency installation. Rich Logging - A new Rich-based log handler offers cleaner, colorized, and more structured console output. 🔁 Transform Updates Folder-to-Parquet Transform - A brand new transform that converts an entire folder of files into a unified Parquet dataset—making it easier to batch-process large document collections. Text Encoder Upgrade - The Text Encoder now uses LanceDB for improved vector storage and retrieval performance. Spark Support for docling2parquet and doc_quality - Both doc_quality and docling2parquet transforms now support Spark execution, enabling scalable distributed processing. 📄 Explore the full release notes: 👉 https://lnkd.in/eZufxzv4 ⭐ Support the project by starring the repo and following our updates! #DataPrepKit #OpenSource #Python #MLOps #RAG #LLM #DataEngineering #AItools

To view or add a comment, sign in

More Relevant Posts

Anurag Swarnakar
2mo
Report this post
Today I built and deployed a basic ML API using Python and FastAPI. The focus was not just training a model, but understanding how ML works inside real backend systems. I implemented: - Request–response flow - Input validation - Model loading at startup - Error handling - SQLite database logging - Clean architecture (API → Service → DB → Model) - Deployment to a public server This helped me understand that ML in production is more about system design and integration than just model accuracy. You can check out the project here: https://lnkd.in/gn_S46VY Small step, but meaningful progress. #MachineLearning #Backend #FastAPI #LearningInPublic
Like Comment
To view or add a comment, sign in
Lokesh deesh V
2mo
Report this post
🚀 Day 5/100 — Working with Persistent Storage 🧠 “Persistence transforms execution into continuity.” Systems become meaningful when they retain and retrieve information reliably. Today, I learned how Python interacts with files to store and retrieve persistent data. ⚙️ 🔧 Today’s focus areas: 📂 File Reading — Accessing stored data 📝 File Writing — Persisting new information 🔄 File Modes — Managing read and write operations 🎯 Data Persistence — Ensuring continuity across executions 🎯 The objective was to enable programs to maintain state beyond runtime. ✅ Day 5 complete: Persistent data handling established. ▶️ Day 6: Strengthening reliability through exception handling. Step by step. The system evolves. 🏗️ #Python #BackendDevelopment #100DaysOfCode #SoftwareEngineering
Like Comment
To view or add a comment, sign in
Md Mahiuddin
1mo
Report this post
Polars is quietly becoming one of the most exciting tools in the modern Python data stack. Most of us have hit the limits of traditional DataFrame workflows: slow group‑bys, memory issues with medium‑large datasets, and complex pipelines that are hard to optimize. Polars tackles all of that head‑on with a fresh design. Docs: https://docs.pola.rs/
Like Comment
To view or add a comment, sign in
Arley Balderas
2mo
Report this post
🔥 Day 4 – Pandas Selection & Production-Style Filtering Today I focused on strengthening my data selection and filtering skills using Pandas — but doing it the right way. Instead of just filtering rows, I practiced production-style defensive programming. Here’s what I worked on: ✅ Column & row selection using .loc and .iloc ✅ Boolean filtering with multiple conditions ✅ Cleaning messy CSV column names ✅ Safe numeric conversion using pd.to_numeric() ✅ Writing a custom function to parse "HH:MM" delay values into proper Timedelta objects ✅ Handling invalid values using pd.NaT ✅ Preventing runtime errors with defensive filtering logic Built a workflow that: • Filters orders with Miles ≤ 30 • Converts delay strings into real time objects • Filters delays ≤ 30 minutes • Ensures no invalid comparisons occur Real-world data is messy. Learning how to clean, validate, and safely filter it is what turns simple analysis into production-ready logic. 📂 GitHub Repository: https://lnkd.in/gNWeQ5KE On to Day 5 🚀 #Python #Pandas #DataEngineering #Analytics #LearningInPublic #100DaysOfCode

GitHub - abalderas16/data-engineering-practice: Hands-on practice with Python and data engineering concepts. github.com
Like Comment
To view or add a comment, sign in
Nuno Bispo
1mo
Report this post
Real-world data is messy. Your models shouldn't be. CSV files, external APIs, user input - they all deliver junk like "N/A", "unknown", or empty strings. Technically valid. Logically useless. Instead of scattering cleanup logic across your codebase, normalize once - at the model level. Downstream code stops worrying about edge cases. Business logic gets simpler. That's the real value of structured models: a hard boundary between messy input and reliable internal state. This pattern - and dozens like it - are covered in Practical Pydantic. https://lnkd.in/eaASBPzP Clean code starts with clean data. #Python #Pydantic #CleanData
2 Comments
Like Comment
To view or add a comment, sign in
Riya Bansal
2mo
Report this post
Unpopular opinion: Is Jupyter *really* the best tool for *everything* in your data science workflow? 🤔 While notebooks are great for exploration, let's talk about building robust, maintainable projects. I'm advocating for a move towards: * Modular Code (.py files): For better organization and reusability. * Git Versioning: Because "final_version_v2_FINAL.ipynb" gives me nightmares. * Unit Testing: Catching bugs before they become full-blown crises. Are we over-relying on notebooks? What are your thoughts on moving towards more structured approaches in data science? Share your experiences in the comments! 👇 #DataScience #MachineLearning #Python #SoftwareEngineering #CodeQuality
Like Comment
To view or add a comment, sign in
Christy Jomy
1mo
Report this post
Built a real-time network traffic dashboard called NetAnlyzer pro The dashboard monitors live internet traffic on a network and breaks it down visually — showing what type of data is moving, which devices are the most active, how traffic behaves over time, and automatically flagging anything that looks suspicious or unusual. It's essentially a live window into what's happening inside a network at any given second. The kind of tool that data and security teams use daily to keep systems running clean. Tools used and what each one did: 🐍 Python — the core language that runs everything 🐼 Pandas — organised and processed the live network data into clean tables 📊 Plotly — turned that data into the interactive charts and graphs ⚡ Dash — built the live web dashboard that updates every second 🖥️ psutil — pulled real-time network stats directly from the system Still learning. #DataAnalytics #Python #NetworkSecurity
Like Comment
To view or add a comment, sign in
Francois Vanderseypen
2mo
Report this post
Knwler is now a proper Python package and available via pipx. No more cloning the repo to get started, you can generate documents with just two CLI commands. In addition: - Complete refactor: the monolithic script became a well-structured package with a clean CLI, making it easier to extend and integrate. - Graph database integrations: import scripts for Neo4j, SurrealDB, and HelixDB are now included out of the box. Your extracted graph can land directly in your database of choice. - Stability fixes: template rendering issues resolved, packaging corrected to ensure all assets ship with the wheel. If you're working with unstructured text and want to turn it into structured knowledge — entities, relationships, communities — knwler does that in a few lines of Python. https://knwler.com Next, I will create exports for RDF (Neptune and Qlever). #knowledgegraph #graphdb

Knwler — Document Intelligence Through Knowledge Graphs knwler.com

11 Comments
Like Comment
To view or add a comment, sign in
Panel

3,805 followers
2mo Edited
Report this post
HoloViz MCP now ships a CLI. The same tools that power AI assistants through the Model Context Protocol — semantic documentation search, component introspection, best-practice skills — are now available directly in your terminal. The namespaces mirror Python imports: `pn`, `hv`, `hvplot`. If you know `import panel as pn`, you already know the CLI. Every command supports three output formats: - `--output pretty` — Rich tables for terminal use (default) - `--output markdown` — for piping into LLMs or documentation - `--output json` — for scripting and automation $ pip install holoviz-mcp
Like Comment
To view or add a comment, sign in

37 followers

View Profile Connect

DPK Release 1.1.7: Python 3.13 Support & Performance Boosts

More Relevant Posts

Explore related topics

Explore content categories