Consumers and enterprises dread that Generative A.I. tools like ChatGPT breach privacy by using convos as training data, storing PII and potentially surfacing confidential data as responses. Prof. Raluca Ada Popa has all the solutions. Today's guest, Raluca: • Is Associate Professor of Computer Science at University of California, Berkeley. • Specializes in computer security and applied cryptography. • Her papers have been cited over 10,000 times. • Is Co-Founder and President of Opaque Systems, a confidential computing platform that has raised over $31m in venture capital to enable collaborative analytics and A.I., including allowing you to securely interact with Generative A.I. • Previously co-founded PreVeil, a now-well-established company that provides end-to-end document and message encryption to over 500 clients. • Holds a PhD in Computer Science from MIT. Despite Raluca being such a deep expert, she does such a stellar job of communicating complex concepts simply that today’s episode should appeal to anyone that wants to dig into the thorny issues around data privacy and security associated with Large Language Models (LLMs) and how to resolve them. In the episode, Raluca details: • What confidential computing is and how to do it without sacrificing performance. • How you can perform inference with an LLM (or even train an LLM!) without anyone — including the LLM developer! — being able to access your data. • How you can use commercial generative models OpenAI’s GPT-4 without OpenAI being able to see sensitive or personally-identifiable information you include in your API query. • The pros and cons of open-source versus closed-source A.I. development. • How and why you might want to seamlessly run your compute pipelines across multiple cloud providers. • Why you should consider a career that blends academia and entrepreneurship. Many thanks to Amazon Web Services (AWS) and Modelbit for supporting this episode of SuperDataScience, enabling the show to be freely available on all major podcasting platforms and on YouTube — see comments for details ⬇️ #superdatascience #generativeai #ai #machinelearning #privacy #confidentialcomputing
Data Privacy Concerns in Open vs Proprietary Models
Explore top LinkedIn content from expert professionals.
Summary
Data privacy concerns in open vs proprietary models revolve around how personal or sensitive information is handled by artificial intelligence systems, with open-source models offering greater transparency and control, while proprietary models often require trusting an external provider with your data. Understanding the differences between these approaches is essential for anyone working with AI tools that process confidential information.
- Assess model transparency: Choose open-source models when you need to see how your data is used and want the ability to run AI solutions fully within your own secure environment.
- Control data access: Keep sensitive or regulated information on-premises by deploying models locally, avoiding the risks of sending data to external proprietary platforms.
- Plan for compliance: Align your AI workflows with privacy laws and organizational policies by selecting solutions that let you fine-tune models with your own internal datasets rather than relying on external providers.
-
-
I just compared the best open-source and closed-source LLMs, and the results were surprising: Well, nobody wants to send their data to Google or OpenAI. Yet here we are, shipping proprietary code, customer information, and sensitive business logic to closed-source APIs we don't control. While everyone's chasing the latest closed-source releases, open-source models are quietly becoming the practical choice for many production systems. Here's what everyone is missing: Open-source models are catching up fast, and they bring something the big labs can't: privacy, speed, and control. I built a playground to test this myself. Used Comet's Opik to evaluate models on real code generation tasks - testing correctness, readability, and best practices against actual GitHub repos. Here's what surprised me: OSS models like MiniMax-M2, Kimi k2 performed on par with the likes of Gemini 3 and Claude Sonnet 4.5 on most tasks. But practically MiniMax-M2 turns out to be a winner as it's twice as fast and 12x cheaper when you compare it to models like Sonnet 4.5. Well, this isn't just about saving money. When your model is smaller and faster, you can deploy it in places closed-source APIs can't reach: ↳ Real-time applications that need sub-second responses ↳ Edge devices where latency kills user experience ↳ On-premise systems where data never leaves your infrastructure MiniMax-M2 runs with only 10B activated parameters. That efficiency means lower latency, higher throughput, and the ability to handle interactive agents without breaking the bank. The intelligence-to-cost ratio here changes what's possible. You're not choosing between quality and affordability anymore. You're not sacrificing privacy for performance. The gap is closing, and in many cases, it's already closed. If you're building anything that needs to be fast, private, or deployed at scale, it's worth taking a look at what's now available. MiniMax-M2 is 100% open-source, free for developers right now. I have shared the link to their GitHub repo in the first comment. You will also find the code for the playground and evaluations I've done. _____ Share this with your network if you found this insightful ♻️ Follow me (Akshay Pachaar) for more insights and tutorials on AI and Machine Learning!
-
Open-source AI models trained on made-up (synthetic) data can perform just as well as GPT-4 in turning radiology notes into structured reports — and they protect patient privacy better. 1️⃣ Researchers made 3000 fake thyroid scan reports using GPT-4, then used them to train several open-source AI models. 2️⃣ These models were tested on real hospital data to see how well they could pull out key details and fill in a standard report template. 3️⃣ The best open-source model (Yi-34B) scored almost the same as GPT-4 when given five examples to learn from. 4️⃣ Some smaller open models even beat GPT-3.5, showing you don’t always need a huge AI to get strong results. 5️⃣ GPT-4 was better at finding the right report sections. Open models had more variation in how accurate they were. 6️⃣ GPT-4 made more mistakes when info was missing. Yi-34B sometimes copied wording directly instead of using standard terms. 7️⃣ Even the smallest model tested (1B) did well, showing it might be possible to run this kind of AI on local hospital computers or phones. 8️⃣ Unlike GPT, open models can run fully inside hospital systems, keeping patient data private and secure. 9️⃣ Using synthetic data means no real patient info is needed, which solves a big privacy and access problem. 🔟 The team suggests training many small models, each focused on one specific report task, to help doctors work faster and more accurately. ✍🏻 Aakriti “Ari” Pandita, MD, Angela Keniston, Nikhil Madhuripan, MD. Synthetic data trained open-source language models are feasible alternatives to proprietary models for radiology reporting. npj Digital Medicine. 2025. DOI: 10.1038/s41746-025-01658-3
-
The hot debate the last couple of days is DeepSeek vs “Western” models. Leaving aside financial impacts, there are interesting points about nation-state competition and whether you should use DeepSeek. Two key issues are potential bias or censorship in its training data (e.g., ignoring Tiananmen Square) and concerns about whether your data ends up in China. These matter, but note that DeepSeek-R1 is open-source, so you can run it offline/locally or in a secure environment, without using DeepSeek the App. 🤔 Remember, an app is just an interface to the underlying model. ChatGPT is a US-based interface for OpenAI’s closed-source models (4o, o1, etc.). Mistral is French. Using those apps means you share data with their providers (potentially for further training or other reasons). Alternatively, you can use open-source models (like DeepSeek) with free, offline “chat” interfaces. ✅ Two of the easiest to use are GPT4All (https://lnkd.in/g4ANvWjD) and AnythingLLM (https://anythingllm.com/). They provide GUIs for Windows, Linux, and MacOS - no complex command-line steps needed. You just install, download the model you want (often from huggingface), and start chatting. Both also offer Retrieval Augmented Generation (RAG), so you can load your own documents offline, build embeddings automatically, and then chat securely while referencing your own materials. If you want the power of open-source models like DeepSeek but want to avoid data privacy issues, these tools are worth trying. Additionally, AnythingLLM includes Agents for scraping websites and browsing within your chats. I’ll be running a webinar soon on how this benefits OSINT collection and analysis. 🤔 There’s also controversy about using models from adversarial nations or models trained in ways that may conflict with “our” social norms. Without debating any specific stance, there’s value in testing how such models might derive conclusions or support decision cycles - especially important for intelligence analysts who must consider how adversaries might use AI. It’s akin to studying adversarial doctrine, but more dynamic. Likewise, cyber threat-actors are developing custom GPTs. Understanding their capabilities is crucial for cybersecurity teams to form effective mitigations. ✅ Ultimately, DeepSeek’s alternative development approach is a net positive. It demonstrates potential compute and energy savings vital to sustainability, and lowers the barrier to entry for those without the resources of Big Tech. Startups can now fine-tune advanced open-source models with reduced CAPEX/OPEX, bringing new solutions to market faster. This also helps investors diversify AI funding and support a broader array of startups. Growing the AI ecosystem benefits everyone tackling “wicked problems” worldwide, as AI will undoubtedly play a direct or indirect role in solving them. #OSINT #GAI #AI
-
Less than 1% of enterprise proprietary data is in a foundation model. Not 30%. Not 10%. 1%. That number is from Michael Conway at IBM in an interview (SIC) with Dun & Bradstreet. Surprising if you're taking the "We'Re OuT oF dAtA" narrative from the big AI labs at face value. Yet if you spend any time with banks, insurers, healthcare providers, telcos, defence, or governments, you probably already know that their highest-value data is still firmly inside their own walls. That means: • Tight access controls • Regulatory pressure • Legacy estates • Air-gaps Add growing concerns about geopolitical risk and model-provider uncertainty, and you're left with a simple question: If certain sensitive data cannot leave an on-prem environment for conventional cloud, why would AI workloads trained on or using the same data be any different? Now overlay what’s happening on the model side. DeepSeek just released new open models (V3.2 and V3.2-Speciale) that hit performance levels once assumed to be unreachable without a nine-figure proprietary lab: • V3.2-Speciale matches Google Gemini-3 Pro • DeepSeek-V3.2 matches GPT-5 on multiple reasoning benchmarks • V3.2-Speciale hits gold-medal scores on International Math Olympiad and Informatics Olympiad tests These models are available under open licences and run wherever you want, including fully controlled on-prem environments. That flips the equation for large enterprises. Because if your proprietary data can’t leave your environment, and open models are now hitting GPT-5-level reasoning under permissive licences, you suddenly have: • Full control • Price stability • Transparency • No IP exposure • Zero data egress • Flexible deployment • On-prem sovereignty • No vendor lock-in risk And it doesn't stop there. AI development tools keep improving, driving down the cost of software development. If the trend continues, there will come a point at which the cost of internal development falls below the cost of purchase. Because if an enterprise can now: • Bring the model to the data • Fine-tune with internal datasets • Run it inside their own security perimeter • Build custom software with the data in mind • Avoid every compliance and residency headache …then the pressure to buy external AI software starts to collapse. At that point, enterprises stop shaping their workflows around external providers and start shaping their suppliers around their own data, infrastructure, and risk posture. The million-dollar question here, as recently discussed with Edward Oakes from Groq, is when does the above transition begin in earnest? And what does it do to a market where most of the bigger pockets of powered land have already been swallowed by companies that the largest potential customer segment cannot compliantly engage with? Interesting times ahead!
-
Should you trust your fine-tuned models when using your private data? A new paper, "Be Careful When Fine-tuning Open-Source LLMs: Your Fine-tuning Data Could Be Stolen!" highlights concerns for those using fine-tuning in downstream tasks. Fine-tuning open-source large language models has quickly become standard practice for companies that want to adapt AI to their specific needs. But a new study out of Tsinghua University raises an urgent red flag: your fine-tuning data may not be as safe as you think. The research shows that open-source models can be backdoored before release in ways that allow the original model creator to later extract your private fine-tuning dataset-even if they only have black-box access to your model. In experiments, attackers were able to recover as much as 76% of the downstream fine-tuning queries in realistic conditions, and nearly 95% under ideal settings. That’s not just memorization during pretraining-this is the leakage of highly curated, proprietary prompts companies rely on to differentiate themselves. Why does this matter? • Proprietary datasets often represent months of work and significant cost. • They may include sensitive or regulated information. • If exposed, competitors could replicate or undermine your strategy overnight. The paper also shows that current detection-based defenses are weak. Even when organizations probe for backdoors, attackers can disguise their extraction triggers in ways that bypass standard checks. This has two big implications for the AI ecosystem: 1. Due diligence on open-source models will need to go beyond benchmarks and licenses. Security auditing and trust in the supply chain must become part of the evaluation process. 2. Stronger defenses are urgently needed. Relying on open-source without rigorous vetting may expose companies to invisible risks. Paper Link: https://lnkd.in/gMhSAYSF Open-source models are powerful tools, but fine-tuning them on valuable private data carries a hidden cost. Without robust safeguards, organizations risk giving away their crown jewels without even realizing it. #AI #LLM #Security #OpenSource #FineTuning
-
Last week I posted about #DeepSeek and concerns about data privacy when using the app. Social media posts are short form and therefore make it hard to allow for nuance, so even though I said this was in reference to the app, specifically, some people took it to mean: stay away from DeepSeek at all costs. That's certainly not what I meant. Everyone's position on this will be slightly different - based on what they use #AI for, what organization they're with, whether they're using it for work or personal tasks, and what form of DeepSeek's technology they're using. Based on how you use it, the potential risk changes. In the most recent episode of their podcast, "Further Comments", Damien Riehl and Horace Wu invited Australian AI entrepreneur Joe Rayment to join them in a fantastic discussion about DeepSeek, which anyone wondering about it should listen to. A few topics they covered are really critical, so I'm highlighting them here (but you should still go and listen to the whole podcast!): 💡 When you're talking about DeepSeek, you could be talking about any one of four things: the web application, the direct API open to developers, an open weights model that can be hosted on any cloud, and the ability to host DeepSeek on your own infrastructure. Data privacy risk (and the risk of some sort of back door being built into the technology) is highest in the web application, but may also be an issue in the direct API. If you are using the open weights model or self-hosting, it is less likely to be a problem, especially if you're using your (approved) cloud provider of choice. 🚨 Bias is a factor when using any model, and those biases might look different depending on where the model originates. A Belgian study cited by Joe found that Chinese models showed higher scores and positive sentiments towards concepts like law and order, social harmony, economic control, supply side economics, and nationalization, whereas Western models were instilled with values like multiculturalism, freedom, and environmentalism. These biases will likely come into play when considering certain areas of law and if your organization is using a model like DeepSeek, or a third party vendor that uses DeepSeek, lawyers may need to be trained to correct for that potential bias. Finally, and as I said in my initial DeepSeek post last week, even though the REAL cost of developing DeepSeek is unknown, the training methodology and the fact that it was able to be developed in the absence of high-performance GPUs will have a significant (positive) impact on the development of generative AI generally and represents a major advance. The fact that it has been released open source is also great for competition in the market and will allow developers working with LLMs to access really high grade #GenAI cheaply. Hopefully those costs savings will ultimately benefit consumers of the products they're building with it. Link to the podcast in comments. #legaltech #law
-
Who Owns “Public” Data and How Far Does That Right Extend? Recent news suggests Microsoft and OpenAI may be investigating whether certain banned accounts - allegedly tied to DeepSeek - used OpenAI’s API to “distill” or replicate their large language models without having to shoulder the full cost of scraping and training. Meanwhile, OpenAI itself has faced legal questions about whether it used copyrighted materials freely available online to train its own models. The Core Tension - Public Data vs. Proprietary Access: If OpenAI’s position is that training on vast amounts of publicly available info is acceptable - even critical - for AI’s advancement, why shouldn’t other companies do the same? - Cost and Early Investment: OpenAI’s argument might rest on the idea that it invested heavily to develop these models first, so it’s unfair for others to bypass the same expense. Yet the principle of “public data for all” can’t be enforced selectively; either these datasets are fair game or they’re not. - Distillation as a Long Standing Technique: Knowledge distillation - where one model is used to refine or train another, often with reduced computational load - isn’t new. It’s been a recognized approach in AI for a while and is undeniably cost-effective. Legal and Ethical Questions - Copyright and Permissions: If creators have not explicitly granted permission for their content to be used as training data, does wide availability on the internet automatically grant AI developers the right to harvest it? - Level Playing Field: Should early movers in AI be able to limit others from using similar public data sources - or from employing recognized techniques like distillation - just because they invested heavily upfront? Ultimately, this debate touches on transparency, intellectual property, and the very norms of AI development. It raises the question: if open data is crucial for advancement, how do we ensure consistent rules that encourage innovation while respecting ethical and legal boundaries? Should companies be free to train their models on any public data or API output, or are more nuanced guardrails necessary to keep the playing field fair and innovation-driven? #innovation #technology #future #management #startups
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development