The Role of Open Source in Big Data

Explore top LinkedIn content from expert professionals.

Summary

Open source plays a major role in big data by providing freely accessible tools and platforms that help organizations manage, store, and analyze massive datasets. Open source in this context means that the software's underlying code is available for anyone to view, modify, and use, making it easier for communities and businesses to innovate and adapt solutions to their specific needs.

Gain flexibility: Using open source tools lets you customize your data infrastructure according to your exact business needs, rather than being limited by closed, commercial platforms.
Control costs: Open source big data solutions help you avoid unexpected expenses by eliminating licensing fees and reducing reliance on third-party services.
Encourage collaboration: Engaging with open source communities enables you to benefit from shared innovations, support, and a faster pace of technology improvements.

Summarized by AI based on LinkedIn member posts

Michael Ryaboy

AI Developer Advocate | Vector DBs | Full-Stack Development

5,018 followers 1y
Report this post
Closed-source embedding models like OpenAI and Cohere are 𝗸𝗶𝗹𝗹𝗶𝗻𝗴 𝘆𝗼𝘂𝗿 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 in production, and it’s time someone said it. Here’s the truth: they’re great for testing, but in 𝘳𝘦𝘢𝘭-𝘸𝘰𝘳𝘭𝘥 production systems, they create more problems than they solve. Here’s why: 𝗦𝗹𝗼𝘄 𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗧𝗶𝗺𝗲𝘀: We’re talking 200-500+ms just to generate an embedding. For real-time applications, that’s a killer. Every query your users make is bottlenecked by the embedding API before you even start searching your vector database. This is because these APIs are optimized for batch inference, not inference with a batch size of one. For context, your entire search pipeline, including reranking, embedding generation and vector search, can often be optimized to well below 50ms. 𝗛𝗶𝗴𝗵 𝗮𝗻𝗱 𝗨𝗻𝗽𝗿𝗲𝗱𝗶𝗰𝘁𝗮𝗯𝗹𝗲 𝗖𝗼𝘀𝘁𝘀: Scaling with these APIs can burn a hole in your budget. Embedding large datasets or making frequent requests leads to outrageous costs—and they only get worse as your data grows. You also have no control over pricing changes that can happen overnight. 𝗗𝗼𝘄𝗻𝘁𝗶𝗺𝗲: When these services go down, so does your ability to serve embeddings. You’re stuck waiting, with no control over when they’ll come back online. It’s not just about downtime; it’s about unpredictable latency spikes that make performance even harder to guarantee. 𝗟𝗮𝗰𝗸 𝗼𝗳 𝗙𝗹𝗲𝘅𝗶𝗯𝗶𝗹𝗶𝘁𝘆: You can’t optimize their models, can’t reduce latency, and can’t deploy them closer to your data for faster over-the-wire responses. If they’re too slow, you’re stuck, and re-embedding your entire dataset with a different provider is costly and time-consuming. So what’s the alternative? 🔑 𝗢𝗽𝗲𝗻-𝗦𝗼𝘂𝗿𝗰𝗲 𝗠𝗼𝗱𝗲𝗹𝘀: 𝗦𝗽𝗲𝗲𝗱: By running open-source embedding models on your own infra, you can reduce embedding latency to <<5ms on a CPU (GPU might be better for high workloads) 𝗖𝗼𝘀𝘁 𝗖𝗼𝗻𝘁𝗿𝗼𝗹: No API costs. You can embed billions of vectors without worrying about the bill. 𝗦𝘁𝗮𝗯𝗶𝗹𝗶𝘁𝘆: Full control over uptime. You’re not relying on third-party services that may go down or change their API overnight. Here’s what I recommend: • Test with closed-source providers, but when you move into production, switch to open-source models (e.g., BGE, MiniLM, Jina). Embed your dataset with open-source and deploy your inference pipeline near your vector search system. • Rerank intelligently: You don’t need a massive model for initial retrieval. Use lightweight models for fast approximate retrieval, oversample, then rerank the top results with a more powerful cross-encoder or LLM for accuracy. • use the Text Embeddings Inference library which, depending on your embedding model, can reduce your inference time to around 1ms. 𝗞𝗲𝘆 𝘁𝗮𝗸𝗲𝗮𝘄𝗮𝘆: Don’t let closed-source APIs hold you back in production. Open-source gives you speed, control, and cost efficiency—everything you need to scale without sacrificing performance.
No more previous content

No more next content
15 Comments
Like Comment
Adrian Brudaru

Open source pipelines - dlthub.com

14,024 followers 1y
Report this post
LinkedIn is transforming its data tech stack with Apache Iceberg and open data formats—here's why that matters for all of us grappling with big data. By adopting Iceberg, LinkedIn enhances data management at petabyte scale, enabling better versioning, schema evolution, and performance. If you're facing challenges in handling massive datasets, LinkedIn's approach to leveraging open data solutions like Iceberg could be the game-changer you need. Read about OpenHouse, the management plane they use for Iceberg tables: https://lnkd.in/exQV__Pq Read more about LI's data infra here: https://lnkd.in/emwkj9GZ

Data Infrastructure engineering.linkedin.com
Like Comment
Prasanna Lohar

Investor | Board Member | Independent Director | Banker | Digital Architect | Founder | Speaker | CEO | Regtech | Fintech | Blockchain Web3 | Innovator | Educator | Mentor + Coach | CBDC | Tokenization

90,888 followers 1y
Report this post
Open Source Data Engineering Landscape 2025... The open source data engineering landscape continues to evolve rapidly, with significant developments across storage, processing, integration, and analytics. Current Momentum - ➟ While this growth demonstrates continued innovation, the year also saw some concerning developments regarding licensing changes. ➟ Established projects including #Redis, #CockroachDB, #ElasticSearch, and #Kibana transitioned to more closed and proprietary licenses, though Elastic later announced a return to open source licensing. ➟ These shifts were balanced by significant contributions to the open source community from major industry players. Snowflake's contribution of #Polaris, Databricks' open sourcing of Unity Catalog, OneHouse's donation of Apache XTable, and Netflix's release of Maestro demonstrated ongoing commitment to open source development from industry leaders. 💡#Apache Foundation maintained its position as a key steward of data technologies, actively incubating several promising projects. 💡#Linux Foundation has also strengthened its position in the data space, continuing to host exceptional projects such as Delta Lake, Amundsen, Kedro, Milvus, and Marquez. The Data engineering landscape 2025 ( https://lnkd.in/dPSpKkq3 ) ➟ Storage Systems: Databases and storage engines spanning OLTP, OLAP, and specialised storage solutions. ➟Data Lake Platform: Tools and frameworks for building and managing data lakes and lakehouses. ➟Data Processing & Integration: Frameworks for batch and stream processing, plus Python data processing tools. ➟ Workflow Orchestration & DataOps: Tools for orchestrating data pipelines and managing data operations. ➟ Data Integration: Solutions for data ingestion, CDC (Change Data Capture), and integration between systems. ➟ Data Infrastructure: Core infrastructure components including container orchestration and monitoring. ➟ ML/AI Platform: Tools focused on ML platforms, MLOps and vector databases. ➟ Metadata Management: Solutions for data catalogs, governance, and metadata management. ➟ Analytics & Visualisation: BI tools, visualisation frameworks, and analytics engines. The open source data ecosystem is entering a phase of maturity in key areas such as data lakehouse, characterised by consolidation around proven technologies and increased focus on operational efficiency. Source - https://lnkd.in/dpZP7WdD
No more previous content

No more next content
Like Comment
Simon Späti

Data Engineer, Author & Educator | ssp.sh, dedp.online

20,153 followers 11mo
Report this post
Imagine building enterprise data infrastructure where you write 90% less code but deliver twice the value. The #opendatastack and #moderndatastack freed us from vendor lock-in, allowing teams to select best-of-breed tools for ingestion, ETL, and orchestration. But this freedom comes at a cost: fragmented governance, security gaps, and potential technical debt when stacking disconnected tools across your organization. On the flip side, closed-source platforms offer #unified experiences but trap you in their ecosystems where you can't access the code or extend beyond their feature sets in case of need. ✦ What if we could have the best of both worlds? ✦ Enter declarative data stacks – open-source solutions that seamlessly integrate powerful orchestration tools while covering your entire data lifecycle. These are complete data platforms with #batteriesincluded where configuration replaces coding, dramatically reducing complexity while implementing best practices by default. --- Yesterday, I talked about how DevOps is the new Data Science in the field, as most data engineering projects morph into an infrastructure as a code project. While this is a good thing per se, it's also taking valuable time away from building the actual code to thrive in business value. A solution to this problem that does not lock us in with swappable compute engines is an open-source #declarativedatastack. Today, I share «How Declarative Data Stacks Enable Enterprise Scale» (https://lnkd.in/eARCttqz) that goes into the mantra behind «Configure, Don't Code». Allowing for better insight velocity through developer productivity by writing less code and integrating it into well-established open-source tools such as Airflow, Dagster, Snowflake Task, etc. You can learn more about how declarative data stacks manage #enterprise data complexity. You'll discover how starlake.ai, an open-source declarative data stack, enables data transformations through simple configurations, helping teams deliver production-ready platforms in days instead of months while automatically maintaining governance and lineage. At its core, Starlake serves as your data warehouse orchestrator – a unified control center that manages your entire data ecosystem through a single, coherent lineage. It is not just a tool but an end-to-end OSS data platform that makes "configure, don't code" a practical reality for enterprise teams. As an enterprise, you get key components such as "Extract, Load, Transform, Orchestrate, and Lineage" as the skeleton of a declarative data stack that is out of the box. The return on investment is almost instant and grows with the rising complexity of the data platform. What's your take on the declarative data ecosystem, and how do you manage complexity in big data engineering projects? #unbundlingvsbundling #opensource #dataengineering
No more previous content

No more next content
18 Comments
Like Comment
Itamar Friedman

Co-Founder & CEO @ Qodo | Intelligent Software Development | Code Integrity: Review, Testing, Quality

16,936 followers 1y
Report this post
While OpenAI opts for a closed approach, #Meta doubles down on open-source models. I assume you've heard about Meta's release of the Llama 3.1 model family by now; otherwise, see comments. Is Meta’s #Llama3 truly open, and can we deduce insights and lessons from the PyTorch (Meta) vs. TensorFlow (Google) battles? --- 1. Community Engagement is Crucial PyTorch's rise to prominence was significantly fueled by its strong community support. #opensource large language models (LLMs) can harness this power by fostering active communities, encouraging contributions, and providing clear documentation. This community-driven innovation can propel open models forward. 2. Ease of Use and Flexibility Matter PyTorch gained popularity due to its user-friendly design and flexibility, appealing to researchers and developers alike. Similarly, open-source LLMs should prioritize ease of use, making it simple for developers to implement, modify, and extend the models, thereby attracting a broader user base. 3. Innovation and Collaboration Drive Progress The competition between PyTorch and TensorFlow led to rapid advancements. Open-source LLMs can benefit from a collaborative approach, enabling widespread contributions that drive innovation and enhance model robustness and versatility. 4. Balancing Openness with Control While open-source models offer transparency and community input, maintaining some control over core development ensures quality and coherence. This balance is essential for open-source LLMs to maintain high standards while benefiting from widespread collaboration. 5. The Reality of Meta's Openness Unlike PyTorch, Meta's open models don't fully share their training data, limiting true openness. For large language models, complete transparency involves sharing datasets and training processes. Without this, the analogy between PyTorch versus TensorFlow and Meta versus OpenAI falls short. Meta needs to offer more comprehensive access to their data and training methodologies to truly compete on openness. 6. Advancing Fine-Tuning Techniques While the lack of full dataset transparency is a challenge, there are ways to mitigate it. Techniques like fine-tuning, including methods such as LoRa (Low-Rank Adaptation), allow developers to adapt and improve models without needing access to the original training data. Advancing these techniques can help bridge the gap, enabling more effective use of open-source models. --- In conclusion, while Meta has the potential to leverage community, usability, innovation, and transparency to *win* over OpenAI (yes, you read this right), it must address the full scope of openness required for LLMs. These are the lessons from the PyTorch vs. TensorFlow battle. But, the journey toward true open-source LLMs requires a deeper commitment to transparency and collaboration over data, or at least advancements in fine-tuning techniques.
No more previous content

No more next content
5 Comments
Like Comment
Pedram Navid

Education @ Anthropic

7,959 followers 1y
Report this post
Open Source is Eating the Data Stack. What's Replacing Microsoft & Informatica Tools? I've been reading a great discussion about replacing traditional proprietary data tools with open-source alternatives. Companies are increasingly worried about vendor lock-in, rising costs, and scalability limitations with tools like SQL Server, SSIS, and Power BI. The consensus is clear: open source is winning in modern data engineering. 💡 What's particularly interesting is the emerging standard stack that data teams are gravitating toward: • PostgreSQL or DuckDB for warehousing • dbt or SQLMesh for transformations • Dagster or Airflow for orchestration • Superset, Metabase, or Lightdash for visualization • Airbyte or dlt for ingestion As one data engineer noted, "Your best hedge against vendor lock-in is having a warehouse and a business-facing data model worked out. It's hard work but keeping that layer allows you to change tools, mix tools, lower maintenance by implementing business logic in a sharable way." I see this shift every day. Teams want the flexibility to choose best-of-breed tools while maintaining unified control and visibility across their entire data platform. That's exactly why you should be building your data platform on top of tooling that integrates with your favorite tools rather than trying to replace them. Vertical integration sounds great, if you enjoy vendor lock-in, slow velocity, and rising costs. Python-based, code-first approaches are replacing visual drag-and-drop ETL tools. We all know SSIS is horrible to debug, slow and outdated. The modern data engineer wants software engineering practices like version control, testing, and modularity. The real value isn't just cost savings - it's improved developer experience, better reliability, and the freedom to adapt as technology evolves. For those considering this transition, start small. Replace one component at a time and build your skills. Remember that open source requires investment in engineering capabilities - but that investment pays dividends in flexibility and innovation. Where do you stand on the proprietary vs. open source debate? And if you've made the switch, what benefits have you seen? #DataEngineering #OpenSource #ModernDataStack #Dagster #dbt #DataOrchestration #DataMesh

58 Comments
Like Comment
Dinko Eror

VP Red Hat EMEA Central, North and Eastern Europe

14,738 followers 11mo
Report this post
My 11th blog this year, very actual: “Free code, strong Europe? Why open source should be our digital compass!” Digital sovereignty is the name of the game. The debate about whether we can afford to be dependent on a few players has become more explosive in the face of political upheaval and economic uncertainty. Some even ask: Are we just a digital colony? My answer: No – but a little more sovereignty certainly wouldn't hurt. Which brings me to my favorite topic: open source!! If we build our digital foundation on open standards and platforms, Europe will retain sovereignty over its data and infrastructure. This is not an abstract ideal, but a practical necessity. Open source is the tool to regain control, transparency and flexibility without reinventing the wheel. Here are 4 arguments that every executive needs to know. #1: Say goodbye to vendor lock-in. Proprietary platforms behave like smugglers at customs – they pack functionality into black-box modules that can only be replaced with great difficulty. Open source, on the other hand, follows the Unix principle of "do one thing and do it well". Companies combine freely available components to create a customized stack – and always own the source code. #2: Security is a team effort. In proprietary silos, a single backdoor can compromise the entire system. Open source, on the other hand, opens the source code to eyes from around the world. Vulnerabilities are often patched in hours instead of months. Those who rely on European privacy and compliance regulations need to know that no one is snooping on their data – and that is exactly what open source provides. #3: Innovate at the speed of the market. In the digital economy, the winners are those who test and scale quickly. Vendor approvals, countless change requests, and lengthy license reviews? That was yesterday. Open source lets you prototype at the speed of light, port to a variety of platforms, and get instant access to new releases. #4: One for all. Imagine a network where SAP applications coexist with OpenStack infrastructure, data flows across federated machine learning solutions, and local data centers become global hubs. An open ecosystem that adheres to strict privacy regulations while taking full advantage of cutting-edge platforms. The fact is: More and more companies are turning to open source when it comes to data protection. And, as is often the case, more and more countries are embracing open source software for public sector applications, with Estonia and Finland leading the way. According to the Linux Foundation Europe's Open Source Maturity in Europe 2024 study, 76% of enterprises now consider open source to be more secure than proprietary software, and 82% support the "public money, public code" principle for publicly funded software. In any case, open source is the fuel with which we can collectively ignite the engine of European digitalization. Link to the Linux Foundation report: https://lnkd.in/eJa_TGmd
No more previous content

No more next content
5 Comments
Like Comment
Yoni Michael

Building typedef.ai | Ex-Tecton & Salesforce Infra | Coolan Co-Founder (acq)

7,092 followers 1y
Report this post
When industry giant #Snowflake decided to provide external table management and support for Iceberg tables, it served as a tipping point — the dawn of truly open data architectures is no longer a future ideal; it’s today’s reality. Seeing a major cloud giant embrace open table formats like Iceberg is a game-changer. It signals a fundamental shift: we're moving away from the era where proprietary lock-in was standard, to a data landscape that’s truly interoperable and future-proof. It’s not just about adding “support” for Iceberg; it's a signal that the days of closed, siloed data models are numbered. Open architectures like Iceberg bring autonomy to data teams — allowing them to build pipelines that work seamlessly across clouds, with choice in engines and tools. This means faster innovation and greater flexibility, but it also empowers the entire ecosystem by reducing the risk of vendor lock-in. But the real question is, how far will they go? The next step would be to treat external Iceberg catalogs as first-class citizens — providing true write capabilities with the same performance users experience within the Snowflake catalog. If Snowflake takes this step, it will be transformative for organizations striving to operate seamlessly across data environments without sacrificing performance. And they’re not alone in this vision. Other major data platforms are following suit, building deep integrations with data lakes and open table formats like Iceberg, Delta Lake, and Hudi, further solidifying the industry’s shift toward open standards. This as a massive step toward a genuinely open data ecosystem, one where choice isn’t a trade-off but an advantage. Interoperability, flexibility, and performance can coexist without compromise. 💪 #DataEngineering #Snowflake #OpenData #DataInfrastructure

3 Comments
Like Comment
Clem Delangue 🤗 Clem Delangue 🤗 is an Influencer

Co-founder & CEO at Hugging Face

302,483 followers 1y
Report this post
Important report "Stopping Big Tech from becoming Big AI" "Open source AI has an important role to play in countering a lack of interoperability and access, and fostering innovation, by lowering barriers to entry, particularly for smaller and less well-resourced actors. Building on open source platforms, developers can create customized AI models and applications without having to make massive investments in computing power, data and other inputs. Open source also supports critical public interest research on the safety and trustworthiness of AI – for example, ensuring that researchers have access to foundation models or their training data, in order to carry out assessments of harmful biases." https://lnkd.in/emzD6rUy
No more previous content

No more next content
11 Comments
Like Comment
Gleb Mezhanskiy

CEO @ Datafold – AI automation for data engineering

15,403 followers 1y
Report this post
What would a 100% open-source modern data stack look like? I've been tracking leading OSS data technologies for a few years and am excited to share the updated roundup (link below) Why talk about open source? Open source alone is not a compelling enough reason to choose a technology for the data stack. However, there are a few good reasons to evaluate open-source data stack elements: > It gives a good sense of the state of technology. OSS operates across vendors and ecosystems and often shapes or at least signals the direction of the industry. > Unlike closed-source SaaS, which often hides its full functionality behind demos and layers of overpromising marketing that can be hard to cut through, open-source SaaS is fully transparent in its capabilities and, therefore, is easier to research and compare. > Some parts of the data stack, such as orchestrators or in-memory data processing engines, greatly benefit from being open source as their users are likely to embed them in their internal applications and/or extend them. > Given regulatory compliance, cost, and location, open source may be the only feasible solution for some organizations. Did I miss anything significant? LMK!
No more previous content

No more next content
48 Comments
Like Comment

The Role of Open Source in Big Data

Summary

More in Open Source Software Trends

Explore categories