Emerging Open Source Database Technologies

Explore top LinkedIn content from expert professionals.

Summary

Emerging open source database technologies are transforming how companies manage, access, and scale their data by offering flexible, community-driven alternatives to traditional proprietary systems. These new tools include formats, platforms, and features that support advanced workloads like AI and provide developers greater freedom and interoperability.

  • Explore modern options: Consider open source database solutions such as PostgreSQL, DuckDB, and DocumentDB to reduce costs and avoid vendor lock-in.
  • Build for interoperability: Choose tools and formats that integrate easily with other platforms, ensuring your data ecosystem remains adaptable as new technologies emerge.
  • Invest in developer flexibility: Adopt features like forkable infrastructure, open APIs, and semantic search capabilities to empower your team with safer testing, smarter data access, and improved workflows.
Summarized by AI based on LinkedIn member posts
  • Last week, I came across two new table format projects — and it made me pause. So I went back to revisit the “open data stack” diagram I had once built live with another OSS founder, to see how things have evolved… and what’s still painful. Here’s what it looks like today 👇 🗂️ File formats: We often talk about Parquet and ORC as the backbone of the modern data lake. But reality is: many still ingest and store JSON, CSVs, and other semi-structured formats. And now, with AI workloads rising, we’re seeing new columnar and hybrid formats like Lance, Nimble, and Vortex — built for wide tables, random access, and unstructured data that Parquet can’t serve well. 📦 Table formats: Beyond the “big three” — Apache Hudi, Apache Iceberg, Delta Lake — we’re seeing experimentation again. Paimon brings an LSM-style storage layout, Lance blends blobs and vectors. New entrants like MSFT’s Amudai (already storing exabytes internally) and CapitalOne’s IndexTables, focusing on search and indexing, validate that there's still a lot of ground to cover here. 🔄 Interoperability: This is where the real work lies. Unification to “one format to rule them all” is a pipe dream — but interoperability is achievable. Projects like Apache XTable (Incubating) and DeltaUniform help bridge open formats and sync them into commercial engines (Onehouse, Databricks, EMR) and closed warehouses (Snowflake, Redshift, BigQuery). Let’s not forget: 70–80% of enterprise data still sits in closed formats today. 🧭 Open Metastores: We’re finally seeing viable Hive replacements — Polaris, UnityCatalog OSS, Gravitino, LakeKeeper. But there’s an important distinction: Open APIs (e.g. Iceberg REST) ≠ Open Metastore servers (operational catalogs). Most projects/vendors are building their own APIs to cover gaps, while exposing their tables as Iceberg using tools like XTable. We’re heading towards a messy N×M world of catalogs vs. formats. 🚧 Open problems: 1: Many Iceberg REST APIs are format-specific; we need multi-format APIs to future-proof interoperability. 2: Even beyond table formats, catalog data (policies, permissions, relationships) is still stored in proprietary databases — we need at least an open interchange format here. 3: The ecosystem still lacks actual, vendor-neutral, meritocratic bodies like the IETF or JCP community that can steward truly open data standards. So… what’s the takeaway? 👉 Unification is a myth. 👉 Interoperability is the path forward. 👉 OSS and users alike have their work cut out — to build a flexible, sustainable data ecosystem that allows innovation without fragmentation. #OpenData #DataLakehouse #ApacheHudi #XTable #Iceberg #DeltaLake #DataEngineering #OSS #Interoperability #TableFormats #Metastore #Data #BigData

  • View profile for Pedram Navid

    Education @ Anthropic

    7,959 followers

    Open Source is Eating the Data Stack. What's Replacing Microsoft & Informatica Tools? I've been reading a great discussion about replacing traditional proprietary data tools with open-source alternatives. Companies are increasingly worried about vendor lock-in, rising costs, and scalability limitations with tools like SQL Server, SSIS, and Power BI. The consensus is clear: open source is winning in modern data engineering. 💡 What's particularly interesting is the emerging standard stack that data teams are gravitating toward: • PostgreSQL or DuckDB for warehousing • dbt or SQLMesh for transformations • Dagster or Airflow for orchestration • Superset, Metabase, or Lightdash for visualization • Airbyte or dlt for ingestion As one data engineer noted, "Your best hedge against vendor lock-in is having a warehouse and a business-facing data model worked out. It's hard work but keeping that layer allows you to change tools, mix tools, lower maintenance by implementing business logic in a sharable way." I see this shift every day. Teams want the flexibility to choose best-of-breed tools while maintaining unified control and visibility across their entire data platform. That's exactly why you should be building your data platform on top of tooling that integrates with your favorite tools rather than trying to replace them. Vertical integration sounds great, if you enjoy vendor lock-in, slow velocity, and rising costs. Python-based, code-first approaches are replacing visual drag-and-drop ETL tools. We all know SSIS is horrible to debug, slow and outdated. The modern data engineer wants software engineering practices like version control, testing, and modularity. The real value isn't just cost savings - it's improved developer experience, better reliability, and the freedom to adapt as technology evolves. For those considering this transition, start small. Replace one component at a time and build your skills. Remember that open source requires investment in engineering capabilities - but that investment pays dividends in flexibility and innovation. Where do you stand on the proprietary vs. open source debate? And if you've made the switch, what benefits have you seen? #DataEngineering #OpenSource #ModernDataStack #Dagster #dbt #DataOrchestration #DataMesh

  • View profile for Anil Inamdar

    Executive Data Services Leader Specialized in Data Strategy, Operations, & Digital Transformations

    14,196 followers

    🧠 Postgres + Vectors: Building the Brain Inside Your Database Postgres isn’t just surviving the AI wave — it’s leading it. For 30+ years, Postgres has outlived every trend: 🔢 Relational 🔄 NoSQL ☁️ Cloud 🤖 And now… AI-native data architectures Today, the magic comes from pgvector, a lightweight yet powerful extension that lets Postgres store and query vector embeddings — mathematical representations of meaning. 🧩 Why Vector Embeddings Matter Embeddings turn your data into semantic fingerprints — numbers that capture context, not just text. 📌 Instead of searching exact words, you search similar meaning. Try asking Postgres: “🔍 Find tickets similar to this customer complaint.” No keyword hacks. No Boolean gymnastics. Just pure semantic similarity. Behind the scenes, Postgres compares vectors and returns conceptually related matches — powering: ✨ Recommendation engines ✨ Semantic search ✨ AI copilots ✨ RAG (Retrieval-Augmented Generation) pipelines ✨ Intelligent CRM + support workflows All using plain SQL. 🧠 Your Database Just Got a Brain The best part? No separate vector DB. No new infra. No integration glue. Your existing Postgres cluster becomes the context engine of your AI stack. For startups → 🚀 Huge cost and ops savings For enterprises → 🏗️ Architectural simplicity and compliance confidence For the Postgres community → 💙 Proof that open source doesn’t follow trends… it shapes them 🌐 The database that powered the web is now powering intelligence. If you’re building AI-native apps, Postgres + pgvector is no longer optional — it’s foundational. 🔖 #Postgres #pgvector #AI #SemanticSearch #RAG #OpenSource #Database

  • View profile for Swapnil Bhartiya

    Chief Executive Officer @ TFiR | The Agentic Enterprise Show | Leading Next-Gen Media Initiatives

    5,022 followers

    Microsoft just moved #DocumentDB to The Linux Foundation—aiming to make document databases an open standard. I sat down with Kirill Gavrylyuk, VP of Azure Cosmos DB at Microsoft, to unpack the why and the what-next. His take was clear: “Document databases are critical for AI apps,” and the industry needs a “vendor‑neutral, open source” path—hence DocumentDB at LF. He also revealed Amazon Web Services (AWS) is joining as a co‑maintainer, with a steering committee that already includes Yugabyte DB , Rippling, and AB InBev. Two quotes that jumped out: — “We don’t believe one vendor should control the standard. The core goal is developer freedom.” — Kirill Gavrylyuk — “It will stay pure Postgres and MongoDB‑compatible—those principles are in the charter.” Roadmap: richer MongoDB compatibility, Kubernetes operator for easy deploys, scale‑out and HA, and AI‑driven capabilities (including broader vector indexing). Also notable: Microsoft uses DocumentDB internally for Cosmos DB vCore and will keep contributing upstream. https://lnkd.in/evTtZmzG #OpenSource #LinuxFoundation #Postgres #DocumentDB #MongoDB #AI #CloudNative #DeveloperExperience #Microsoft #CosmosDB

  • View profile for 🐯 Michael Freedman

    Tiger Data Cofounder & CTO | Princeton CS Professor

    6,948 followers

    Forkable infrastructure is emerging as the next primitive in data infrastructure. We’re seeing more and more customers ask for more than snapshots or backups. They want true branches of their database — forks they can spin up instantly, test safely, and (sometimes) merge back. Supabase and Neon are both pushing in this direction, but with different philosophies: ▪️ Supabase Native Migrations: tightly integrated with GitHub and CLI workflows. Open a PR → spin up a branch → run SQL migrations → merge applies them to production. ▪️ Neon ORM-Powered Migrations: instant copy-on-write forks, but migrations are left to your ORM or toolchain (Prisma, Drizzle, Flyway, etc.). The nuance: not all schema changes are equal. — Adding tables or columns touches application code: natural for an ORM to own. — Adding an index or constraint rarely requires code changes: you want to test these directly in the DB, often for performance reasons, before committing. AI coding agents may push this divide further. If your program already uses an ORM, an agent will almost certainly stay within that framework: models + migrations together. That works for schema tied to code. But it doesn’t address database-only optimizations, the kind of changes developers want to experiment with directly inside a fork. Forkable infra sits right at this intersection: a tool for both human creativity and agent-driven automation. I’ve been thinking a lot about developer workflows for data infrastructure. Should forkable infra be opinionated about migrations, or should it stay raw and flexible, letting humans and agents layer their own workflows on top? Or perhaps somehow both? Curious to hear your thoughts. What do you think? #ForkableInfra #DataInfrastructure #DatabaseBranching #Postgres #DeveloperExperience #AICodingAgents #DataEngineering

  • View profile for Dipankar Mazumdar

    Director, Data/AI @Cloudera | Apache Iceberg, Hudi Contributor | Author of “Engineering Lakehouses”

    17,783 followers

    Open source projects on my list in the Data Space! Open source projects have gained remarkable prominence in the data engineering/infra space over the past year. Not just with user adoption, but also with vendor adoption. I consider myself very lucky to have had the opportunity to work dedicatedly with some of these projects in the last couple years. Some of these projects are going to have a significant impact on the overall industry this year too! Here is my list for 2025 (in no order): ✅ Apache Arrow: I started working with Arrow ~3.5 years back. Part of my role was to talk to engineers about Arrow & what it does. Today, most people know the impact it has made. It started as an in-memory columnar format & today it is a software development platform powering so many critical tools. ✅ Apache Datafusion: DataFusion is an extensible query engine that uses Arrow as its in-memory format. It features a full query planner, a columnar, streaming, multi-threaded, vectorized execution engine, and partitioned data sources. It also has grown into sub-projects such as Ballista & Comet. ✅ SlateDB SlateDB is an embedded storage engine that uses object storage to offer high durability and straightforward replication. Unlike traditional LSM-tree storage engines, it writes all data directly to object storage, eliminating the need for local disks & their caveats. ✅ LanceDB LanceDB is a vector database designed to store, manage, query & retrieve embeddings on large-scale multi-modal data. The unique point with Lance is it supports storage of the actual data itself, alongside the embeddings, which brings some advantages. ✅ Feldera Another project that has my interests on. Feldera is a high-performance query engine (SQL) designed for incremental computation. Specifically it works well with unbounded streams of live & historical data. ✅ Daft Daft has been my go-to Python compute the past year & I have only good things to say. Daft is built for multimodal workloads as well & offers both Python & SQL APIs. Start with something small in your local system & easily scale to petabyte-scale! ✅ Velox Velox is an unified execution engine that provides reusable, extensible & high-performance data processing components for building execution engines. It kind of brings runtime optimizations to the mass. ✅ Apache Hudi, Apache Iceberg These lakehouse platform/formats needs no introduction. They have been central to the way we are dealing with data architectures today & will continue to innovate with critical features. ✅ Apache XTable (Incubating) Interoperability between lakehouse formats has been crucial all throughout. Projects like XTable has showed us efficient ways to do it. Adoption by enterprise softwares such as Microsoft Fabric to interoperate with Snowflake proves its robust capability. A detailed read coming soon! #dataengineering #softwareengineering

  • View profile for Sanjeev Mohan

    Principal, SanjMo & Former Gartner Research VP, Data & Analytics | Author | Podcast Host | Medium Blogger

    23,875 followers

    After a massive labor of love, I'm thrilled to share my new blog post on the vast and ever-evolving world of data stores! https://lnkd.in/g725V_d4 The database landscape has undergone a "Cambrian explosion" over the last two decades. We've moved far beyond the traditional monolithic RDBMS, and today's choices include everything from in-memory data grids to specialized graph, time-series, and vector databases. This piece is a comprehensive guide to navigating this complex ecosystem. I cover: ✅ The core types of data (transactions, interactions, observations) ✅ A detailed taxonomy of data stores ✅ A look at how AI and modern workloads are driving the next wave of innovation Whether you're an architect, a developer, or a data leader, this blog will help you understand the nuances and make the best choice for your next project. Please let me know what you think A quick note on the products mentioned: The examples in this guide are representative, not comprehensive. The data store landscape is vast and constantly changing, so always refer to the latest product documentation for your specific use case. #Data #Databases #AI #Technology #DataEngineering #Cloud #NoSQL #RDBMS #DBMS #GraphDB

  • View profile for Shivji kumar Jha

    Staff Engineer(Distributed data system internals) at Nutanix

    7,496 followers

    Yesterday, I had the chance to present and learn across three sessions, each exploring a different layer of building analytics on modern data architectures. My Topic: Hacking Iceberg on Your Existing DBs Summary: 1. Iceberg: the emerging common table format for #OLTP, #OLAP, and #streaming systems. 2. How the Postgresql implementation is being designed 3. How the ClickHouse OSS implementation is being designed Context: Iceberg is seeing growing adoption across traditional data engines. ClickHouse, traditionally fast for OLAP, is now integrating Iceberg to serve lakehouse-style workloads. Highlights of ClickHouse Capabilities: 1. Natively queries 70+ file formats including Parquet and Arrow. 2. Vectorized execution, fast columnar format, window functions, rich SQL. 3. Foreign file querying with zero ingestion—just point and query. 4. S3, GCS, Hive, Kafka, and REST catalog integrations. 5. Ongoing support for Iceberg-specific features: schema evolution, partition pruning, time travel (experimental). Technical Internals: 1. Parquet reader supports parallel row-group reads, prefetching, and concurrent file processing. 2. Modular engine design (IStorage.h, IDatabase.h) supports custom data sources (like Iceberg via StorageObjectStorage). Future Roadmap & Tools: 1. Project Antalya by Altinity: Makes querying Iceberg storage from ClickHouse horizontally scalable with swarms. 2. PostgreSQL ecosystem also catching up: pg_duckdb, pg_mooncake, pg_analytics, and Hydra—all targeting columnar/vectorized OLAP workloads on PG. 3. Use of planner and executor hooks for query stealing, FDW integrations, and TAM-level extensions. #ClickHouse #ApacheIceberg #LakehouseArchitecture #RealTimeAnalytics #OLAP #OpenSourceDatabases #DataLakehouse #VectorizedExecution #Parquet #PostgreSQL #pg_duckdb #pg_mooncake #pg_analytics #DuckDB #DataEngineering #ProjectAntalya #OSS #CatalogIntegration #QueryEngine

  • View profile for Gleb Mezhanskiy

    CEO @ Datafold – AI automation for data engineering

    15,396 followers

    DuckDB continues to eat the data world, one iceberg at a time. Last year, Databricks acquired Tabular, the team behind Apache Iceberg, and Snowflake announced its own OSS implementation of Iceberg—Polaris. At that point, Iceberg seemed to have cemented itself as the industry standard open table format for data lakes. It seemed robust, transactional, and fast enough—what else did you need? Fast forward a year – DuckDB announces DuckLake and boldly claims it's superior to Iceberg in several ways, but most importantly: 1 - A more robust support for transactions, including frequent updates to data and multi-table transactions. 2 - Speed (individual query latency as well as throughput) Both issues can become significant pains for Iceberg users at scale. The most interesting thing is that DuckLake's alleged breakthrough comes not from making the table format more complicated (as with Hive Metastore > Delta > Iceberg evolution) but simpler! Unlike Delta and Iceberg, which rely on metadata files stored alongside actual data on object storage (and thus suffer from the limitations and complexities of dealing with files), DuckLake uses a relational database (PostgreSQL, DuckDB, etc.) as a primary metadata store. This is precisely what the good old Hive Metastore used to do, but unlike Hive, DuckLake is way smarter about metadata management and implements features and transaction guarantees that Hive users could only dream of. So, should you switch to DuckLake now? In the foreseeable future, the main limiting factor for adoption will be the interoperability of the DuckLake format with the broader data ecosystem: as of today, only DuckDB can work with DuckLake, which is great since you can run DuckDB anywhere! The flipside of that coin is that DuckDB (at least the OSS engine) is not yet widely supported by other data tools. You can (sort of) run DuckDB with dbt Core, but you wouldn't be able to use most BI tools. Two vectors would enable that: 1 - Other server data engines (e.g. Trino) add support for DuckLake 2 - Data tools (orchestrators, BI, etc.) add support for DuckDB I expect (1) to happen sooner than (2) because the latter requires vendors to manage compute (since DuckDB is in-process), whereas with all other engines, they would connect to a server and send SQL to execute. MotherDuck helps bridge that gap by offering a hosted DuckDB. In any case, DuckLake is a fantastic addition to the data ecosystem. Huge kudos to Mark Raasveldt and Hannes Mühleisen!

  • View profile for Monica Sarbu

    CEO & Founder at xata.io, ex Elastic, co-founder Packetbeat

    3,758 followers

    Today, we’re open-sourcing the core of Xata under the Apache 2.0 license. It’s built for agentic workloads, where databases need to be created, used, and discarded continuously: 🤖 Give every agent its own database branch, with real data ⚡ Instant database branching (copy-on-write), regardless of whether you have 50GB or 5TB of data 💸 Pay only for compute, storage is shared with the parent 🌙 Branches hibernate when idle and wake up on the first connection (scale-to-zero) 🐘 100% vanilla Postgres (no forks, full compatibility) You can adopt branching without moving your production database, and we’re excited to see teams run it in their own environments, build on it, and contribute back. ⭐ Star the repo: https://lnkd.in/dNXYUb94 If you’re curious how this works under the hood: Branching is handled at the storage layer using copy-on-write, where new databases initially share the same data and only diverge on writes, making creation instant, even for large datasets. Because storage is decoupled from compute and exposed over the network, each Postgres instance can run independently. Kubernetes then orchestrates these instances, starting and scaling branches as needed. 👉https://lnkd.in/d_tKfK_m

Explore categories