This paper describes how a large pharmaceutical company adopted an ontology-based data management strategy to ensure scientific data is findable, accessible, interoperable, and reusable from the moment it is generated. 1️⃣ The approach emphasizes creating structured, high-quality data at the source to preserve context and reduce downstream processing time. 2️⃣ Standardized vocabularies and models (ontologies) are used to align data across systems and teams, supporting consistency and integration. 3️⃣ Public ontologies are adapted with organization-specific extensions while maintaining compatibility with external data standards. 4️⃣ Simplified term lists are derived from complex models to enable broader adoption across teams with varying technical backgrounds. 5️⃣ Data from different systems is integrated virtually rather than physically moved, enabling secure, real-time access without redundancy. 6️⃣ This framework enhances the performance of advanced analytics and machine learning by providing clear, semantically rich context. 7️⃣ Controlled vocabularies are delivered through interfaces like APIs and dropdowns, ensuring consistent metadata usage at scale. 8️⃣ The unified semantic structure improves enterprise search, allowing users to retrieve contextually relevant data from across domains. 9️⃣ Adoption metrics show growing usage across multiple phases of the pharmaceutical value chain, reflecting system scalability and value. 🔟 Organizational alignment—from executive support to operational implementation—has been critical, with recent advances in AI further enabling this transformation. ✍🏻 Shawn Zheng Kai Tan, Shounak Baksi, Thomas Gade Bjerregaard, Preethi Elangovan, Thrishna Kuttikattu Gopalakrishnan, Darko Hric, Joffrey Joumaa, Beidi Li, Kashif Rabbani, Santhosh Kannan Venkatesan, Joshua Daniel Valdez, Saritha Vettikunnel Kuriakose, Digital evolution: Novo Nordisk’s shift to ontology-based data management. Journal of Biomedical Semantics. 2025. DOI: 10.1186/s13326-025-00327-4
Scientific Data Management
Explore top LinkedIn content from expert professionals.
Summary
Scientific Data Management is the practice of organizing, storing, and connecting scientific information—like experiment results and measurements—so researchers can easily find, understand, and reuse it. Modern approaches focus on making scientific data structured and unified, breaking down barriers caused by scattered files and incompatible formats.
- Adopt unified systems: Use centralized data structures or platforms to keep all experimental and analytical data organized and accessible from a single place, minimizing manual tracking and errors.
- Standardize terminology: Create shared vocabularies and data models that work across different scientific teams and tools, so everyone speaks the same language and data can flow smoothly between systems.
- Model relationships: Structure your data to capture connections between experiments and data types, enabling scientists to discover new insights and reuse information for future research.
-
-
Data is the new lab bench in biotech, but most companies have a broken bench. Let me explain why this 123 approach is changing everything: Most biotech data goes unanalyzed—trapped in siloed systems, proprietary formats, and disconnected workflows. The fundamental problem? Traditional architectures treat each experiment as isolated rather than part of an interconnected knowledge web. This creates a massive cognitive burden for scientists who spend more time wrangling data than making discoveries. The solution isn't just better databases—it's creating what I call a "memory layer" for scientific knowledge. This layer has 3 critical components: 1) Structure first, analysis second Most labs try to analyze raw data directly without proper structure. Effective systems focus on building semantic models that define relationships between experimental components before analysis begins. This seemingly simple shift helps our customers dramatically reduce analysis time and enable previously impossible cross-experimental insights. 2) Graphs, not tables Biological systems are interconnected networks, yet we force data into rigid tables. Modern graph databases mirror how science actually works—through relationships, connections, and patterns. This approach allows scientists to discover "hidden bridges" between seemingly unrelated experiments. 3) Compound intelligence The true power emerges when these structured, graph-based systems learn over time. Each experiment enriches the model rather than sitting as a static data point. This creates compounding value where the 100th experiment is far more valuable than the first because it connects to everything before it. One genomics startup we worked with implemented this approach and saw remarkable acceleration: • They identified targets in weeks rather than months • Their experimental iterations became significantly faster • Scientists uncovered novel insights from existing data What's fascinating is that this approach makes scientists more effective while creating defensible IP in the data model itself. The biotech companies gaining the most investor traction aren't just producing molecules—they're building knowledge systems that get more valuable with every experiment. This is why forward-thinking VCs now evaluate data architecture as thoroughly as science. As we enter this new era, companies that build proper memory layers will outperform those still treating data as an afterthought. Wet lab scientists: Want to see how this memory layer approach could transform your research? DM me for a demo or subscribe to my newsletter: https://lnkd.in/gsyuTb_5
-
Still wrangling endless CSVs in your lab workflow? There's a smarter way: unify all your data with xarray. Curious how a single data structure can simplify everything? Read on. After years of managing experimental and machine learning data across scattered files and formats, I realized the cognitive load of keeping everything aligned was overwhelming. I started exploring unified data structures to reduce this friction. For example, I once spent days writing index-matching code just to keep my training data, features, and model outputs in sync across multiple files. It was exhausting and error-prone—one small misalignment could break the whole pipeline. This experience pushed me to look for a better, unified approach. Traditional lab data management means scattered files, mismatched indices, and constant manual bookkeeping. It's error-prone and exhausting. Inspired by a recent talk at SciPy, I built a synthetic microRNA study example to show how xarray can unify raw measurements, computed features, and model outputs in a single, coordinate-aligned Dataset—no more index-matching headaches. With xarray, you can store all your experimental measurements, computed features, statistical estimates, and even train/test splits in one dataset. Every piece of data knows exactly where it belongs—no more index juggling. In my latest blog post, I walk through this synthetic example step by step. The result? Cleaner workflows, bulletproof data consistency, and cloud-native scalability. If you're ready to reduce friction in your experimental data lifecycle, check out my blog post for a practical guide. Would love to hear your thoughts or experiences! https://lnkd.in/eXqGJB57 How are you currently managing complex experimental or ML data? Have you tried a unified approach like xarray? #datascience #laboratoryinformatics #machinelearning #xarray #bioinformatics
-
Why is development of a holistic data strategy so hard in pharmaceutical R&D? The complexity lies in how we approach the challenge. Technology-driven data lakes often fail to deliver value, while use case driven solutions provide quick wins but don't scale effectively across the organization. The reality of pharmaceutical R&D adds multiple layers of complexity: - Scientific processes span multiple domains - molecular biology, analytical chemistry, process development and clinical research - Each domain generates unique data types from diverse instruments, often in proprietary formats - Critical context exists in unstructured lab notebooks and regulatory documentation - Data needs to flow seamlessly while maintaining compliance and scientific rigor - Domain-specific vocabularies evolve independently, creating semantic gaps between systems A domain-focused approach provides a better foundation: - Start with understanding scientific workflows and processes before jumping to technology implementation - Develop standardized ontologies that bridge molecular, cellular, and process-level concepts - Create unified vocabularies that work across LIMS, ELN, and analytical systems while preserving domain-specific precision - Establish data governance frameworks that maintain terminology consistency from instrument data capture through analysis - Build data models that connect structured experimental data with its scientific context through standardized terms The challenge of vocabulary standardization is particularly critical. When analytical chemists, molecular biologists, and process engineers all use different terms for related concepts, data integration becomes nearly impossible. We need unified taxonomies that preserve scientific meaning while enabling cross-domain analysis. This approach creates both immediate tactical value through targeted solutions and long-term strategic infrastructure that can effectively support AI/ML initiatives. The key is understanding that data strategy must follow scientific processes, not try to force-fit them into generic IT frameworks. I've found that looking at data landscape through scientific domain lenses rather than IT systems often reveals hidden integration opportunities that traditional approaches miss. #datastrategy #lifesciences #pharma #biotechnology
-
🚨 Here's the uncomfortable truth about "unstructured" data in life sciences: We've been approaching it all wrong. When we label complex scientific data that come in bespoke formats, such as genomic variants, single-cell data, and biomedical imaging, as "unstructured," we automatically give up on properly organizing them. The result? Data silos, inefficient storage, and missed discoveries that could change lives. But here's what I've learned after years of building data systems: No data is truly unstructured. An image isn't random pixels; it's a precise 2-D matrix. Genomic variants have clear positional relationships. The real challenge isn't that biological data lacks structure. The issue is that we've lacked a data model flexible enough to capture its complexity efficiently. At TileDB, we've solved this with multi-dimensional arrays that shape-shift to handle any data type — from tables to genomics to imaging — in a unified system with database-level performance. The implications? Researchers can finally ask cross-modal questions that were previously impossible. Drug discovery accelerates when data engineering stops being the bottleneck. What's your experience with complex scientific data? Are you still fighting with multiple formats and tools? Read more about multi-dimensional arrays for multimodal multiomics data: https://lnkd.in/ebfihBUg #DataScience #LifeSciences #DrugDiscovery #MultimodalData #Multiomics
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development