Personal data is highly sensitive information we entrust to internet companies, and strong regulations require these companies to handle it safely and reliably to meet security, privacy, and compliance standards. In this tech blog, Airbnb’s data science team shares how they built a data classification workflow to establish a unified strategy for identifying and classifying data across all data stores. The workflow is built on three pillars: Catalog, Detection, and Reconciliation. The Catalog pillar focuses on creating a dynamic and accurate system to identify where data resides and organize it into a comprehensive inventory. Detection addresses the question: what data might be considered personal? This step involves a detection engine structured as a pipeline to scan, validate, and control thresholds for surfacing detected results. Finally, Reconciliation ensures accurate classification by involving data owners in a human-in-the-loop process to confirm or refine detected classifications. Given the complexity of the system, the team developed metrics to assess its quality. These metrics—recall, precision, and speed—evaluate how effectively, accurately, and efficiently the classification system operates, ensuring it safeguards personal data over the long term. Additionally, the team shares strategies for governing data classification early in the process, along with best practices for improving workflows. These insights provide a clear understanding of not only the metrics but also actionable ways to enhance classification systems. Highly recommended reading for anyone interested in data governance and security. #datascience #personal #data #governance #classification #metrics – – – Check out the "Snacks Weekly on Data Science" podcast and subscribe, where I explain in more detail the concepts discussed in this and future posts: -- Spotify: https://lnkd.in/gKgaMvbh -- Apple Podcast: https://lnkd.in/gj6aPBBY -- Youtube: https://lnkd.in/gcwPeBmR https://lnkd.in/gqxuQ29E
Core Strategies for Privacy Detection Systems
Explore top LinkedIn content from expert professionals.
Summary
Core strategies for privacy detection systems are methods used to identify, classify, and protect sensitive personal information within digital environments, ensuring privacy is preserved while data is still usable. These approaches blend technical workflows, anonymization, and signal disruption to guard against privacy threats and data misuse.
- Build comprehensive inventories: Create a dynamic catalog that maps out where sensitive data resides across your systems, making it easier to monitor and manage privacy risks.
- Use anonymization pipelines: Implement detection engines and replace sensitive information with placeholders so real data stays hidden, especially when processing through cloud-based tools.
- Disrupt identity signals: Regularly audit and remove unnecessary personal data, and take steps to degrade persistent identity signals that could be used to re-identify individuals.
-
-
📱 Mobile GUI Agents are the next frontier in personal assistants, but privacy requirements are much higher in such environments. These agents capture and process entire screen contents, exposing phone numbers, addresses, messages, and financial data to cloud-based MLLMs. This is where Knowledgator's PII GLiNER models come in. I want to highlight a new study that uses our GLiNER models as the core PII detection engine in a privacy protection framework for mobile agents: 🔒 Anonymization-Enhanced Privacy Protection for Mobile GUI Agents: Available but Invisible 🔗 https://lnkd.in/gQgM6TSR 🤝 Knowledgator's role The authors used our gliner-pii-large-v1.0 model as the NER backbone for detecting sensitive entities across UI text, XML hierarchies, and OCR-extracted screenshot content: 🔗 https://lnkd.in/gf5yRvG5 The framework enforces an "available but invisible" principle: sensitive data is replaced with deterministic, type-preserving placeholders (e.g., PHONE_NUMBER#a1b2c) so cloud-based agents can still complete tasks — but never see real PII. GLiNER powers the first layer of this pipeline, detecting PII with 60+ entity categories, running locally, with no API calls required. 📊 What the study shows Tested on AndroidLab (138 real mobile tasks) and PrivScreen (500+ screenshots with 1,000+ synthetic PII), the framework: ▪️Achieved the lowest privacy leakage across all tested models ▪️ Maintained strong task success rates with only modest utility degradation ▪️GLiNER inference: ~0.66s per image on a single GPU (~2,800 MB VRAM) ▪️ Total privacy layer overhead: only ~1.77s per image ▪️ Best privacy–utility trade-off among all compared methods The system anonymises user prompts, XML trees, and screenshots simultaneously, and de-anonymises only during local execution, keeping real data entirely off the cloud. 💡 Takeaway: As mobile agents move toward real-world deployment, privacy can't be an afterthought. This work shows that lightweight, specialised models like GLiNER can serve as the privacy backbone for agentic systems, running locally, processing in under a second, and protecting sensitive data without breaking agent functionality. I see that this trend will go beyond just PII reduction, and SLMs will take more tasks running locally, while larger cloud-based models will handle orchestration and more complex tasks.
-
Deep Identity vs. Privacy Settings When I talk to clients about #DigitalExecutiveProtection, I have to educate them: Privacy is a structural battle. Modern identity systems operate at a deep layer. ⚙️ 𝗧𝗵𝗲 𝗜𝗻𝗳𝗿𝗮𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲 𝗥𝗲𝗮𝗹𝗶𝘁𝘆 Digital identity today runs on structural rails. • Payment networks • Device telemetry • Cross-domain identity graphs These systems do not just store your data. They bind identity signals across sectors Here are some examples: 📡_ Device telemetry creates a silent identity based on hardware and browser characteristics. Even if you delete accounts, these signals can reconnect you across services. 📊_ Identity graphs stitch together phone numbers, emails, devices, and behavior into probabilistic profiles. 🪧_ Once identity signals converge, they become structurally durable. Deleting a single source rarely resets the system. ⚫ 𝗪𝗵𝘆 𝗢𝗽𝘁-𝗢𝘂𝘁 𝗜𝘀𝗻'𝘁 𝗘𝗻𝗼𝘂𝗴𝗵 Most privacy tools operate at the interface layer. The part users can see. But the real identity infrastructure sits underneath. So clicking “opt-out” often becomes privacy theater. It manages the interface. It does not change the underlying rails. ⚫ 𝗦𝗼 𝗪𝗵𝘆 𝗗𝗼 𝗗𝗲𝗹𝗲𝘁𝗶𝗼𝗻 𝗮𝗻𝗱 𝗣𝗿𝗼𝘁𝗲𝗰𝘁𝗶𝗼𝗻 𝗦𝘁𝗶𝗹𝗹 𝗠𝗮𝘁𝘁𝗲𝗿? Because even when identity rails exist, the system still needs signals to function. Those signals can be disrupted. That is where privacy strategy matters. 🔥 Breaking the Linkage Identity becomes durable when signals converge. For example: Phone + credit card + home address Professional deletion cab break those connections before they harden into structural anchors. 🔥 Signal Degradation Identity signals decay over time. By removing data brokers and limiting digital exhaust, old signals become probabilistic guesses instead of deterministic facts. 🔥 Surface Area Reduction Threat actors rarely start with the deep identity rails. They begin with OSINT and search systems. Once all of the visible surface data is removed, it becomes dramatically harder for them to hook into deeper identity infrastructure. 🔥 Early Intervention Once identity is bound to financial or infrastructure rails, it becomes systemic. Strategic privacy work prevents new identity anchors from forming. ⚫ 𝗧𝗵𝗲 𝗢𝗜𝗤 𝗦𝘁𝗿𝗮𝘁𝗲𝗴𝘆 ObscureIQ goes beyond surface-level deletion. Our objective is identity disruption. • DeepDelete across high-risk data brokers • Reduce device telemetry exposure • Stop new digital exhaust from forming • Erode the signal accuracy inside identity graphs Because modern privacy requires degrading the systems that try to track you. Making those systems less certain and more probabilistic. Full blog post: https://lnkd.in/eQJjadSR Part of the Identity Infrastructure Series from ObscureIQ. #privacy #dataprivacy #digitalidentity #identityinfrastructure #databrokers #cybersecurity #techpolicy #osint
-
Unveiling 𝗜𝗱𝗲𝗻𝘁𝗶𝗳𝗶𝗮𝗯𝗶𝗹𝗶𝘁𝘆: Ever encounter the LINDDUN framework? It's privacy threat modeling's gold standard, with 'I' signifying Identifiability - a threat that can strip away the veil of anonymity, laying bare our private lives. A real-life instance: Latanya Sweeney re-identified a state governor's 'anonymous' medical records using public data and de-identified health records. Here, the supposed privacy fortress crumbled. Identifiability can compromise privacy, anonymity, and pseudonymity. A mere link between a name, face, or tag, and data can divulge a trove of personal info. So, what can go wrong? Almost everything. Designing a system or sharing dataset? Embed privacy into the core. Being a Data Privacy Engineer, consider these strategies: 1. Limit data collection. 2. Apply strong anonymization techniques. 3. Release pseudonymized datasets with legal protections. 4. Generate a synthetic dataset where applicable. 5. Audit regularly for re-identification vectors. 6. Educate stakeholders about risks and mitigation roles. Striking a balance between data utility and privacy protection is tricky but crucial for maintaining trust in our digitized realm. Reflect on how you're handling 'Identifiability'. Are your strategies sufficient? Bolster your data privacy defenses now.
-
Encrypted Client Hello (ECH): A 𝐭𝐰𝐨-𝐞𝐝𝐠𝐞𝐝 𝐬𝐰𝐨𝐫𝐝 𝐟𝐨𝐫 𝐩𝐫𝐢𝐯𝐚𝐜𝐲 𝐚𝐧𝐝 𝐝𝐞𝐭𝐞𝐜𝐭𝐢𝐨𝐧 (JA3 / JA4 focus) ECH improves client privacy by encrypting the ClientHelloInner (SNI + many extensions) but that same protection removes the raw inputs used by TLS fingerprinting (JA3/JA3S) and reduces the on-path efficacy of newer fingerprints (JA4) unless telemetry moves server-side or detection uses additional signals. Why ECH breaks fingerprinting 🟠 JA3/JA3S depend on cleartext ClientHello fields (version, ciphers, extensions, curves) to build hashes. If those fields are moved into an encrypted ClientHelloInner, an on-path sensor sees only the outer wrapper JA3 inputs vanish or change. 🟠 JA4 increases robustness (modular, multi-signal, human-readable strings) and handles extension reordering better, but it still cannot access fields that are cryptographically hidden on-path so JA4 helps but does not fully solve ECH’s visibility loss. Concrete actions for experts 🔵 Collect server-side TLS telemetry where possible. If you control the app or CDN, log ClientHelloInner / server JA3/JA4 at the trust boundary and ingest into SIEM/NDR. This restores the missing fingerprint. 🔵 Enrich signal sets. Correlate outer-hello metadata, TLS record sizes/timing, QUIC/HTTP patterns, TCP/IP flow features, DNS/DoH logs, and host/process telemetry. Multi-vector correlation reduces single-signal failure. 🔵 Adopt JA4 (and JA4+) where feasible. JA4’s modular strings and multi-protocol signals are more resilient to randomization and provide better hunting primitives than opaque MD5 JA3s. Still expect gaps when ECH is present. 🔵 Behavioral and similarity detection. Move from exact-match hashes to similarity/clustering models on flow and behavior features; validate for false positives. 🔵 Monitor ECH adoption & risk. Build dashboards for %ECH handshakes, destinations using ECH, and correlations with failed detections — prioritize controls based on measured adoption. Links: - https://lnkd.in/dWwtH3D6 - https://lnkd.in/dwD3hUqf #EncryptedClientHello #ECH #TLS #JA3 #JA4 #TLSFingerprinting #NetworkSecurity #ThreatHunting #NDR #SOC #IR #ServerTelemetry #PrivacyVsVisibility #TLS1_3 #QUIC #FlowAnalytics #BehavioralDetection #CDF #CDN #CyberSecurity
-
Privacy in 2025: Why Relying on People Will Lead to Failure 💡 In 2025, privacy programs that rely heavily on people and relationships are setting themselves up for failure. Here’s why: 1️⃣ People Forget. New projects, new tools, or even small tweaks to processes often go unreported. Not intentionally—just because it’s human nature. 2️⃣ Relationships Have Limits. Privacy teams can’t build personal connections with every developer, marketer, or product owner. And even when relationships exist, they aren’t a foolproof system for ensuring compliance. 3️⃣ The Pace Is Too Fast. In today’s tech-driven world, projects move faster than ever. Privacy programs relying on people to report data usage or risks will always be playing catch-up. 🚨 The result? Gaps in your privacy program, compliance risks, and a loss of trust. 2025 is the year to flip the script. Privacy programs must evolve from being relationship-dependent to being insight-driven. 👉 Shift to Proactive Privacy: Stop waiting for people to come to you. Use tools that monitor systems, code, and data flows automatically. Surface privacy risks directly to teams before they even realize there’s an issue. 👉 From People to Processes: Build workflows and systems that provide teams with the privacy insights they need, tailored to the work they’re doing. 👉 Assess Continuously, Not Periodically: Move beyond static assessments to real-time analysis of privacy risks as work progresses. 👉 Predict and Prevent: Automate privacy detection and give teams actionable recommendations—so they don’t have to ask for guidance. Relationships are a good foundation, but in 2025, they’re no longer enough. True privacy maturity means being less reliant on people and more focused on embedding privacy into the tools and processes teams already use. Don’t let human dependency be the weak link in your privacy strategy. It’s time to evolve. #Privacy2025 #GDPR #DataProtection #PrivacyTech
-
Ever try to manage cookie and consent compliance at scale? We’ve automated monitoring across more than 100 enterprise websites and mobile apps, and here’s what we learned. First, what you expect isn’t always what you’ll find. Even in mature organizations, we uncovered dozens of unapproved trackers, shadow tags, expired consent notices, and signals that were flat out ignored by third party tools. Manual audits miss these. Every. Single. Time. Automating this process surfaced a few hard truths: - Sites and apps constantly change. Hardcoded scanning rules break fast. - Marketing teams often add new tags without telling privacy, creating silent risks. - Consent banners, even from top CMPs, don’t always behave the way you expect, especially after new releases. - Mobile apps have their own unique consent gaps, especially with SDKs updating in the background. But with real-time, automated monitoring, we spotted issues within hours, not months. A few lessons that stuck with us: 1. Pair code and UI analysis. You need to see both what users and systems see. 2. Don’t rely on blocklists, they get outdated overnight. Use anomaly detection to spot new risks. 3. Build privacy checks into existing marketing and dev workflows from the start. Bottom line: automation doesn’t just catch more issues, it forces alignment across teams and keeps privacy in step with the speed of business. If you’re still relying on periodic manual checks, you’re probably missing more than you know.
-
Embedding privacy-by-design principles into new projects and systems ensures that privacy is considered throughout the entire lifecycle of the project, from initial design through development, deployment, and decommissioning. Here’s how to embed these principles with examples: Privacy-by-Design Principles 1. Proactive not Reactive; Preventative not Remedial: • Embed privacy features proactively rather than as an afterthought. • Example: When designing a new customer feedback app, include encryption and secure data storage from the outset to prevent data breaches. 2. Privacy as the Default Setting: • Ensure personal data is automatically protected in any IT system or business practice. • Example: A new online service automatically opts users out of data sharing by default. Users must explicitly opt in if they choose to share their data. 3. Privacy Embedded into Design: • Integrate privacy into the architecture of IT systems and business practices. • Example: When developing a mobile banking app, ensure that data minimization is a core feature, collecting only the data necessary for the service. 4. Full Functionality – Positive-Sum, not Zero-Sum: • Avoid unnecessary trade-offs; ensure both privacy and functionality are achievable. • Example: A health app provides personalized services without compromising user privacy by using anonymized data for analytics. 5. End-to-End Security – Lifecycle Protection: • Secure personal data throughout its entire lifecycle, from collection to deletion. • Example: In a new document management system, implement encryption, secure access controls, and regular data deletion policies. 6. Visibility and Transparency – Keep it Open: • Ensure all stakeholders are aware of data practices, and the systems are open to scrutiny. • Example: A cloud service platform includes clear privacy policies, regular privacy impact assessments (PIAs), and audit logs available to users and regulators. 7. Respect for User Privacy – Keep it User-Centric: • Prioritize user privacy preferences and control. • Example: A social media platform allows users to easily manage their privacy settings and provides tools for users to understand and control how their data is used.
-
You're deploying machine learning models in real-time. How do you ensure data privacy and security? 1. Data Encryption Use Transport Layer Security (TLS) or Secure Sockets Layer (SSL) to encrypt data during transmission. This ensures that sensitive information, such as personal data, is protected from interception while it’s being transferred between clients and servers. 2. Data Anonymization and Pseudonymization Anonymize data before processing it to remove personally identifiable information (PII). 3. Access Control and Authentication Implement RBAC to ensure that only authorized personnel or systems have access to sensitive data. Each user or system should only have access to the data they need to perform their job or task. 4. Differential Privacy Differential Privacy is a technique where noise is added to the data to ensure that individual records cannot be identified from aggregated outputs, even in models that provide real-time predictions. 5. Secure Machine Learning Algorithms Implement homomorphic encryption, which allows computations to be performed on encrypted data without needing to decrypt it first. This ensures that sensitive data remains private while being processed. Instead of centralizing data in a server, federated learning allows training models directly on decentralized data (e.g., data on users’ devices). 6. Data Minimization Collect only the data needed for model training and inference. Avoid retaining excessive data or using data that is not essential for model performance. Establish and follow data retention policies to ensure that sensitive data is deleted when no longer necessary, minimizing the risk of exposure. 7. Data Auditing and Monitoring Continuously monitor data flows, access logs, and model outputs to detect unusual patterns that could signal a potential data breach or security vulnerability. Regularly check for data integrity to ensure no unauthorized data modifications. 8. Regulatory Compliance Ensure compliance with data privacy regulations such as the General Data Protection Regulation (GDPR) or California Consumer Privacy Act (CCPA). This includes ensuring that data subject rights, like access, rectification, and deletion of personal data, are respected. 9. Secure Deployment Practices Ensure the security of the deployed models by using techniques such as adversarial training (to make models robust against adversarial attacks) and regular vulnerability scanning. Protect the environments (e.g., cloud platforms, servers) where your models are deployed. Ensure these systems are secure by configuring firewalls, intrusion detection systems, and ensuring that they are regularly patched. 10. Collaboration with Security Teams Work closely with cybersecurity professionals to assess and mitigate potential risks associated with machine learning systems. This could include performing threat modeling, implementing secure development practices, and ensuring that appropriate security measures are in place.
-
𝐃𝐞𝐞𝐩 𝐃𝐢𝐯𝐞: 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐃𝐚𝐭𝐚 𝐏𝐫𝐢𝐯𝐚𝐜𝐲 𝐢𝐧 𝐌𝐋 𝐌𝐨𝐝𝐞𝐥𝐬 - 𝐏𝐚𝐫𝐭 𝟐 Following our previous discussion on MIA and DP, let's explore two more crucial concepts in ML privacy: Model Reconstruction Attacks (MRA) and their countermeasure, Federated Learning. 𝐌𝐨𝐝𝐞𝐥 𝐑𝐞𝐜𝐨𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐢𝐨𝐧 𝐀𝐭𝐭𝐚𝐜𝐤𝐬 (𝐌𝐑𝐀): Model Reconstruction Attacks (MRA) represent a sophisticated class of privacy breaches where adversaries attempt to reconstruct the original training data or extract critical model parameters through systematic probing of the target model's behavior. Imagine a digital archaeologist who can reconstruct ancient artifacts from their fragments. MRAs work similarly, but with data: 1. Attack Mechanism: • Systematically probes model behavior to reconstruct training data • Exploits neural networks' inherent tendency to memorize training data • Uses optimization techniques to "reverse-engineer" training samples 2. Three-Phase Attack Pipeline: a) Attack Surface Analysis • Analyzes learned parameters (weights, biases) • Studies gradient responses and prediction distributions • Maps decision boundaries and behavioral patterns b) Reconstruction Process • Generates synthetic inputs through random initialization • Performs iterative refinement using gradient descent • Matches gradients to align with observed behaviors • Maps between model embeddings to discover latent features c) Output Validation • Assesses reconstruction quality • Compares statistical similarities • Implements feedback loops for refinement Real-world Impact: In computer vision, MRAs have successfully reconstructed recognizable facial images from facial recognition models – a serious privacy concern. 𝐓𝐡𝐞 𝐃𝐞𝐟𝐞𝐧𝐬𝐞: 𝐅𝐞𝐝𝐞𝐫𝐚𝐭𝐞𝐝 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠 (𝐅𝐋) Enter Federated Learning – a revolutionary approach to protecting against MRAs: 1. Core Concept: • Instead of data coming to the model, the model goes to the data • Training happens locally on devices/organizations • Only model updates are shared, never raw data • Even if an attacker compromises the global model, they can only see aggregated updates, not individual training samples 2. How it Works: • Decentralized nodes train local models on private data • Nodes share only parameter updates with central server • Server aggregates updates using Federated Averaging (FedAvg) • The aggregation process acts as a natural defense against reconstruction • Global model improves while data stays private 3. Privacy Mechanisms: • Differential privacy during training • Secure aggregation for communication • Model distillation for deployment 4. Real-world Success Stories: • Google: Mobile keyboard prediction • Healthcare: Collaborative research on patient data • Cross-organizational: Privacy-compliant machine learning #MachineLearning #Privacy #Security #DataScience #FederatedLearning #ML #ArtificialIntelligence #Cybersecurity #DistributedComputing
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development