If you're following AI, data licensing, and the closing off of the web to web crawlers/scrapers, this is a fresh study with hard data on the topic. Abstract: "General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora. Our audit of 14,000 web domains provides an expansive view of crawlable web data and how codified data use preferences are changing over time. We observe a proliferation of AI-specific clauses to limit use, acute differences in restrictions on AI developers, as well as general inconsistencies between websites' expressed intentions in their Terms of Service and their robots.txt. We diagnose these as symptoms of ineffective web protocols, not designed to cope with the widespread re-purposing of the internet for AI. Our longitudinal analyses show that in a single year (2023-2024) there has been a rapid crescendo of data restrictions from web sources, rendering ~5%+ of all tokens in C4, or 28%+ of the most actively maintained, critical sources in C4, fully restricted from use. For Terms of Service crawling restrictions, a full 45% of C4 is now restricted. If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems. We hope to illustrate the emerging crises in data consent, for both developers and creators. The foreclosure of much of the open web will impact not only commercial AI, but also non-commercial AI and academic research." https://lnkd.in/ebTfnCgZ?
How Data Restrictions Affect AI Training
Explore top LinkedIn content from expert professionals.
Summary
Data restrictions limit the access and use of information required to train artificial intelligence (AI) models, which can impact their performance, diversity, and real-world applicability. As legal and regulatory barriers increase, AI developers face challenges in sourcing fresh, high-quality data, making it harder for AI systems to learn and adapt.
- Monitor changing rules: Stay updated on evolving data regulations and website policies to understand what training data is accessible and avoid legal risks.
- Consider synthetic data: Explore generating realistic, physics-grounded synthetic datasets to overcome shortages and train AI on scenarios that may be rare or impossible to capture.
- Tap alternative sources: Look into less-utilized datasets, such as telco behavioral data or domain-specific archives, to diversify training materials and reduce dependence on restricted web content.
-
-
You cannot train physical AI on reality alone. There is not enough of it. Jensen Huang explains why NVIDIA built Alpamayo, a robotics model that learns from synthetic data grounded in physics. The problem is fundamental. Teaching physical AI like autonomous vehicles or robotics requires vast amounts of diverse interaction data. Videos exist. Lots of videos. But hardly enough to capture the diversity and type of interactions needed. So NVIDIA transformed compute into data. Using synthetic data generation grounded and conditioned by laws of physics, they can selectively generate training scenarios reality cannot provide. The example Huang shows is remarkable. A basic traffic simulator output gets fed into Cosmos AI world model. What emerges is physically based, physically plausible surround video that AI can learn from. This solves a constraint that limited physical AI development. You cannot train autonomous systems on every possible scenario by recording reality. There are not enough cameras, time, or situations. But you can simulate physics accurately enough that AI trained on synthetic data generalizes to real environments. Why this matters beyond autonomous vehicles. Any AI learning physical interactions faces the same data scarcity problem. Manufacturing robots, warehouse automation, infrastructure inspection, medical robotics. All require training on scenarios that are rare, dangerous, or impossible to capture at scale. Synthetic data generation grounded in physics laws becomes essential infrastructure for physical AI deployment. The organizations building AI for physical systems will either master synthetic data generation or remain limited by whatever reality they can record. Watch the full presentation to hear Huang explain how Alpamayo generates training data for autonomous vehicles that think like humans. What physical AI application needs synthetic data because reality cannot provide enough examples?
-
Telcos Hold the Most Underused Dataset for Real-World AI.. But they are not allowed to use it. While most foundation models today are trained on text scraped from the internet, telcos capture real-world behavioral signals at scale: data that reflects how people move, communicate, and interact with infrastructure and services in a physical space. This is not a language. It is timestamped, geospatial, structured behavioral data that can be used to model reality, not just simulate language. A typical mobile operator with 10 to 20 million subscribers collects billions of data points daily. These include cell tower transitions every few seconds per active device, app session patterns by time of day, call initiation and duration, SIM swaps, device changes, recharge frequency for prepaid users, and signal quality metrics across geography. Unlike text scraped online, this data is structured, time-series based, and anchored to physical behavior. What makes it unique is its ability to infer latent variables that language cannot see. In multiple research studies, airtime purchase history has outperformed credit bureau scores in predicting loan repayment. During COVID-19, aggregated mobility data from operators in Spain, France, and Italy was used to model lockdown effectiveness with a higher resolution than official transportation metrics. In countries like Bangladesh and Indonesia, telco data has been used to track population displacement during floods, and to measure recovery by analyzing the reappearance of device activity in disaster zones. If telcos had regulatory parity with digital platforms, they could use this data to train behavior models at a national scale. These models can predict urban demand, simulate epidemiological spread, forecast economic stress based on collective movement patterns, and enable real-time adaptive systems for energy, transportation, and public services. Language models simulate what humans say. Telco-derived models can simulate what humans do. The bottleneck is not technical. It is regulatory. While OTTs collect deep behavioral data through app SDKs and web tracking, telcos are prohibited from using even aggregated data for secondary AI applications even when anonymized. This asymmetry prevents the development of AI systems that reflect the physical world. If we want foundation models that are grounded in reality, the telco dataset must be part of the equation.
-
I once believed that licensing requirements for AI training data could stifle competition, leaving only large companies able to train foundation models. But our research shows that without robust preference and compensation mechanisms, websites are restricting access to their content which makes it hard to responsibly train AI while also undermining access to information for everyone. In our recent article, accepted to NeurIPS and covered by The New York Times, we analyze the preference signals from 14,000+ web domains commonly used in pretraining. We observe a growing trend of AI data restrictions in websites' Terms of Service and robots.txt files. In an already murky legal environment, more sites are now restricting access to their content. These restrictions don't just hamper AI innovation—they also weaken the open web, limiting information access for all. Our work highlights the urgent need for clear regulations on data use for AI training, showing that current preference signals are insufficient. A huge thank you to the amazing team of 50+ collaborators, annotators, and advisors who made this research possible. Special thanks to Shayne Longpre, Ariel N. Lee Campbell Lund, and advisors Alex 'Sandy' Pentland Jad Kabbara Sara Hooker Daphne Ippolito Hanlin Li Stella Biderman Luis Villa and Caiming Xiong!
-
Are you risking your company’s IP and customer personal data for the convenience of meeting transcription? Convenience is great, but not at the cost of accidentally donating your crown-jewel knowledge and customer personal data to someone else’s AI lab. AI-powered meeting transcription services are becoming increasingly popular - they offer so much convenience, sometimes even for free. I spent a few days combing through the actual Privacy Policies and Terms of Service for four popular AI notetakers—Otter.ai, Read.ai, Fireflies.ai, and tl;dv—to see whether they train their models on your conversations. I have no association with any of them, but what I found is worrying. Here’s the short version: 🔹 Otter.ai – On by default. Otter trains its speech-recognition models on 'de-identified' audio and text of your conversations. They claim that personal identifiers are stripped, but your confidential data still fuels their AI unless you negotiate a restriction. 🔹 Read.ai – Your choice. By default your data is not used. If you opt in to its Customer Experience Program, your transcripts can help improve the product. 🔹 Fireflies.ai – Aggregated-only. They forbid training on identifiable content, limiting themselves to anonymised usage statistics. No individual transcript feeds their AI. 🔹 tl;dv – Never. They explicitly prohibit using customer recordings for model training. Transcript snippets sent to their AI engine are anonymised, sharded, and not retained. Why it matters: Even “de-identified” data can leak competitive IP or sensitive customer information if models are ever breached or repurposed. Business recordings can contain personal data, meaning you’re still on the hook for consent, minimisation, and transfer safeguards. Your management, board and clients may assume you’ve locked this down; finding out later is awkward at best, non-compliant at worst. By the way - true anonymisation of data is exceptionally difficult, especially in complex data like speech. Claims that only 'deidentified' data is used for training needs to be scrutinised. Not one of the products reviewed provided any meaningful technical information about how they achieve this. What to do next: 1. Read the legal docs—marketing pages are full of assurances, but they don’t tell the full story. Read the privacy policies and terms of service. 2. Decide your red line: zero training, aggregated-only, or opt-in? 3. Configure or negotiate: most vendors offer enterprise DPAs or private-cloud options if you ask. 4. Review the consent flows: it’s not just your rights—your guests’ data is in play too. Have you asked the meeting participants if they are happy to hand their personal data and IP to a third party? Convenience is great, but not at the cost of accidentally donating your crown-jewel knowledge to someone else’s AI lab. I write about Doing AI Governance for real at ethos-ai.org. Subscribe for free analysis and guidance: https://ethos-ai.org #AIGovernance
-
European Data Protection Board issues long awaited opinion on AI models: part 3 - anonymization (See Part 1: https://shorturl.at/TYbq3 consequences and Part 2: https://shorturl.at/ba5A1 legitimate interest legal basis). 🔹️AI models are not always anonymous; assess case by case. 🔹️ AI models specifically designed to provide personal data regarding individuals whose personal data were used to train the model, cannot be considered anonymous. 🔹️For an AI model to be considered anonymous, both (1) the likelihood of direct (including probabilistic) extraction of personal data regarding individuals whose personal data were used to develop the model and (2) the likelihood of obtaining, intentionally or not, such personal data from queries, should be insignificant, taking into account ‘all the means reasonably likely to be used’ by the controller or another person. 🔹️ Pay special attention to risk of singling out, which is substantial 🔹️ Consider all means reasonably likely to be used by the controller or another person to identify individuals which may include: characteristics of training data, AI model & training procedure; context; c. additional information; costs and amount of time needed to obtain such info; available technology & technological developments. 🔹️ Such means & levels of testing may differ between a publicly available and a model to be used only internally by employees. 🔹️ Consider risk of identification by controller & different types of ‘other persons’, including unintended third parties accessing the AI model, and unintended reuse or disclosure of model. Be able to prove, through steps taken and documentation, that you have taken effective measures to anonymize the AI Model. Otherwise, you may be in breach of your accountability obligations under Article 5(2) GDPR. Factors to consider: 🔹️ selection of sources: (selection criteria; relevance and adequacy of chosen sources; exclusion of inappropriate sources. 🔹️ preparation of data for training phase: (could you use anonymous or pseudonymous); if not why not; data minimisation strategies & techniques to restrict volume of personal data included in training process; data filtering processes to remove irrelevant personal data. 🔹️ Methodological choices regarding training: improve model generalisation & reduce overfitting; privacy-preserving techniques (e.g. differential privacy) 🔹️ Measures regarding outputs of model (lower likelihood of obtaining personal data related to training data from queries). 🔹️ Conduct sufficient tests on model that cover widely known, state-of-the-art attacks: eg attribute and membership inference; exfiltration; regurgitation of training data; model inversion; or reconstruction attacks. 🔹️ Document process including: DPIA; advice by DPO; technical & organisational measures; AI model’s theoretical resistance to re-identification techniques. #dataprivacy #dataprotection #privacyFOMO #AIFOMO Pic by Grok
-
A health system deploys an AI coding tool. Accuracy improves measurably. The vendor asks to use operational data to refine the model for that health system's documentation patterns. The health system's counsel says no. Blanket prohibition, non-negotiable. Do you know about the HIPAA provision that creates a blanket prohibition on using Protected Health Information (PHI) for AI model training? It doesn’t exist. I’ve negotiated AI language in technology transactions from multiple vantage points over the last several years. I’ve requested “no training” language from vendors. I’ve represented healthcare organizations in vendor negotiations. And I’ve responded to this language as a health information technology company serving healthcare provider organizations, health plans, and pharmacutical companies, and more. A pattern recurs: a contractual position on “no model training with PHI” that organizations adopt reflexively but often struggle to ground in a consistent regulatory explanation. Organizational policies and upstream contractual commitments can limit how PHI is used with AI models, and those limitations may be perfectly rational. But they are business constraints, not regulatory ones. The HIPAA Privacy Rule does not prohibit using PHI for AI model training. It provides a framework of data use purposes (e.g., treatment, healthcare operations, research, proper management and administration for business associates) that help determine what permissions and safeguards apply. Also, where PHI is involved, the structural terms of the deal affect the regulatory analysis. And before the characterization analysis begins, there's a threshold question: if the training uses deidentified data under HIPAA, HIPAA's use restrictions don't apply. A "no training" clause that covers deidentified data is restricting something outside HIPAA's scope. The healthcare industry would be better served if more organizations worked through HIPAA’s regulatory framework before defaulting to a blanket prohibition. A carefully crafted prohibition may be appropriate in some cases. But it may also foreclose activities that are permissible and beneficial. I’ve seen firsthand how model accuracy improves when models learn from the operational patterns of the healthcare organizations they serve. I’d like to see sharper discourse about using PHI with AI (and separately, creating PHI with AI...). I'm working on a deeper analysis of the HIPAA characterization framework for AI model training using recent real world examples. If this is a conversation you're having internally or if you're negotiating these provisions, I'm interested to hear how your organization is approaching it.
-
Is AI governance actually a new discipline—or simply data governance pushed to its breaking point? As organizations race to deploy AI, many jump straight to model policies and ethics committees. But this image tells a more uncomfortable truth 👇 Every AI governance failure starts as a data governance shortcut. Here’s how the evolution really happens: 🟣 Data Quality Management → Model Reliability in Production In traditional data systems, quality meant accuracy and completeness. In AI systems, it determines hallucinations, instability, and silent failure. Poor data doesn’t just create bad dashboards—it creates unpredictable models. 🟣 Data Lineage → Bias Detection & Auditability Lineage used to be about debugging pipelines. Now it’s about answering regulators, customers, and leadership: “Why did the model make this decision?” If you can’t trace training data, bias remains invisible. 🟣 Access Controls → Ethical Use Boundaries Role-based access once protected sensitive tables. Today, it defines which datasets models are allowed to learn from— and prevents leakage of regulated or private information into training loops. 🟣 Data Catalogs → Model Registries & Discovery Metadata discipline scales into model inventories: versions, intended use, risk level, and ownership— without this, models sprawl faster than data ever did. 🟣 Compliance Frameworks → Regulatory Readiness GDPR, HIPAA, and CCPA weren’t just legal checklists. They trained organizations for what’s coming next: EU AI Act, algorithmic accountability, and explainability mandates. 🟣 Data Stewardship → Model Ownership & Accountability Clear data owners once reduced pipeline chaos. Now they prevent the most dangerous failure mode in AI: “No one owns the model when it breaks.” 🟣 Schema Standards → Training Data Specifications Schemas once enforced consistency. In AI, they define what valid learning even means— directly shaping model behavior and stability. 🟣 Versioning & Change Control → Drift Monitoring Data versioning helped explain yesterday’s broken reports. Model drift monitoring explains why yesterday’s accurate model fails today— even when no code changed. 🟣 Security & Encryption → Adversarial Defense Security no longer stops at data at rest and in transit. It now includes protection against prompt injection, data poisoning, and model extraction attacks. 🟣 Data Quality Metrics → Explainability Requirements Measurement has expanded: from completeness and freshness → fairness, confidence, and transparency. If it can’t be measured, it can’t be trusted. The takeaway: AI governance is not a side initiative. It’s data governance under real-world pressure, with legal, ethical, and business consequences. If your data governance is immature,your AI governance will be performative—no matter how polished the policy deck looks. Let’s discuss: Which layer in this evolution is your organization struggling with the most right now?
-
The Oregon Department of Justice released new guidance on legal requirements when using AI. Here are the key privacy considerations, and four steps for companies to stay in-line with Oregon privacy law. ⤵️ The guidance details the AG's views of how uses of personal data in connection with AI or training AI models triggers obligations under the Oregon Consumer Privacy Act, including: 🔸Privacy Notices. Companies must disclose in their privacy notices when personal data is used to train AI systems. 🔸Consent. Updated privacy policies disclosing uses of personal data for AI training cannot justify the use of previously collected personal data for AI training; affirmative consent must be obtained. 🔸Revoking Consent. Where consent is provided to use personal data for AI training, there must be a way to withdraw consent and processing of that personal data must end within 15 days. 🔸Sensitive Data. Explicit consent must be obtained before sensitive personal data is used to develop or train AI systems. 🔸Training Datasets. Developers purchasing or using third-party personal data sets for model training may be personal data controllers, with all the required obligations that data controllers have under the law. 🔸Opt-Out Rights. Consumers have the right to opt-out of AI uses for certain decisions like housing, education, or lending. 🔸Deletion. Consumer #PersonalData deletion rights need to be respected when using AI models. 🔸Assessments. Using personal data in connection with AI models, or processing it in connection with AI models that involve profiling or other activities with heightened risk of harm, trigger data protection assessment requirements. The guidance also highlights a number of scenarios where sales practices using AI or misrepresentations due to AI use can violate the Unlawful Trade Practices Act. Here's a few steps to help stay on top of #privacy requirements under Oregon law and this guidance: 1️⃣ Confirm whether your organization or its vendors train #ArtificialIntelligence solutions on personal data. 2️⃣ Validate your organization's privacy notice discloses AI training practices. 3️⃣ Make sure organizational individual rights processes are scoped for personal data used in AI training. 4️⃣ Set assessment protocols where required to conduct and document data protection assessments that address the requirements under Oregon and other states' laws, and that are maintained in a format that can be provided to regulators.
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development