Research from Harvard & MIT used AI to unlock molecular insights in cancer pathology. Foundation models are revolutionizing computational pathology. But, most struggle to analyze entire whole-slide images (WSIs) and incorporate molecular data. 𝗧𝗛𝗥𝗘𝗔𝗗𝗦 𝗶𝗻𝘁𝗿𝗼𝗱𝘂𝗰𝗲𝘀 𝗮 𝗺𝘂𝗹𝘁𝗶𝗺𝗼𝗱𝗮𝗹 𝗳𝗼𝘂𝗻𝗱𝗮𝘁𝗶𝗼𝗻 𝗺𝗼𝗱𝗲𝗹 𝘁𝗵𝗮𝘁 𝗹𝗲𝗮𝗿𝗻𝘀 𝗳𝗿𝗼𝗺 𝗯𝗼𝘁𝗵 𝗵𝗶𝘀𝘁𝗼𝗽𝗮𝘁𝗵𝗼𝗹𝗼𝗴𝘆 𝘀𝗹𝗶𝗱𝗲𝘀 𝗮𝗻𝗱 𝗺𝗼𝗹𝗲𝗰𝘂𝗹𝗮𝗿 𝗽𝗿𝗼𝗳𝗶𝗹𝗲𝘀. • 𝗣𝗿𝗲𝘁𝗿𝗮𝗶𝗻𝗲𝗱 𝗼𝗻 𝟰𝟳,𝟭𝟳𝟭 𝗛&𝗘-𝘀𝘁𝗮𝗶𝗻𝗲𝗱 𝗪𝗦𝗜𝘀 𝘄𝗶𝘁𝗵 𝗴𝗲𝗻𝗼𝗺𝗶𝗰 𝗮𝗻𝗱 𝘁𝗿𝗮𝗻𝘀𝗰𝗿𝗶𝗽𝘁𝗼𝗺𝗶𝗰 𝗽𝗿𝗼𝗳𝗶𝗹𝗲𝘀, the largest dataset of its kind. • Enabled state-of-the-art survival prediction, identifying high-risk patients with up to 8.9% higher accuracy than previous models. • 𝗘𝘅𝗰𝗲𝗹𝗹𝗲𝗱 𝗶𝗻 𝗹𝗼𝘄-𝗱𝗮𝘁𝗮 𝘀𝗰𝗲𝗻𝗮𝗿𝗶𝗼𝘀, achieving near-clinical accuracy with just 4 training samples per class. • Introduced “molecular prompting”, allowing AI to classify cancer types and mutations without task-specific training. I like that the architecture of THREADS is notably modular. It begins with an ROI encoder based on CONCHV1.5 (a ViT-L model fine-tuned with vision–language data) to extract patch features. The patch features are then aggregated into a slide-level embedding via an attention-based multiple instance learning (ABMIL) slide encoder. In parallel, distinct encoders for transcriptomic data (a modified scGPT) and genomic data (a multi-layer perceptron) create molecular embeddings. This design not only enables integration of heterogeneous data types but also achieves remarkable parameter efficiency. For instance, THREADS is reported to be 4× smaller than PRISM and 7.5× smaller than GIGAPATH, yet outperforms them on 54 oncology tasks. Here's the awesome work: https://lnkd.in/g5y5HFuV Congrats to Faisal Mahmood, Anurag Vaidya, Andrew Zhang, Guillaume Jaume, and co! I post my takes on the latest developments in health AI – 𝗰𝗼𝗻𝗻𝗲𝗰𝘁 𝘄𝗶𝘁𝗵 𝗺𝗲 𝘁𝗼 𝘀𝘁𝗮𝘆 𝘂𝗽𝗱𝗮𝘁𝗲𝗱! Also, check out my health AI blog here: https://lnkd.in/g3nrQFxW
Developments in Genomic Data Mining
Explore top LinkedIn content from expert professionals.
Summary
Genomic data mining refers to using advanced computational methods to analyze large sets of genetic information, helping scientists uncover patterns and relationships that drive disease, development, and treatment. Recent developments in this field are making it easier to find hidden genetic signals, link gene expression changes to health conditions, and make genetic data more accessible for researchers everywhere.
- Embrace open data: Tap into newly standardized and accessible genomic datasets to accelerate your research and answer big questions about genetic influences on disease.
- Explore new analysis methods: Try innovative techniques, like multimodal AI models or nucleotide dependency maps, to reveal complex genetic interactions and pinpoint functional elements within DNA.
- Integrate diverse data: Combine genomic information with other biological or medical data, such as neuroimaging or pathology slides, to gain a more complete picture of how genes impact health and development.
-
-
BioSkryb Genomics customers just unveiled a mind-blowing hypothesis: that we can find new therapies by studying how our own body's cells mutate in response to their environment What if your liver quietly figured out how to fight fatty liver disease — and left a genetic record of exactly how it did it? That's the frontier opened by a landmark review just published in Cell by researchers from the Wellcome Sanger Institute, UT Southwestern Medical Center, and Quotient Therapeutics — a company co-founded by some of the world's leading somatic genomics scientists. The insight is elegant and profound: our bodies are running continuous evolutionary experiments. Every organ accumulates somatic mutations across a lifetime. Most are silent. But some — selected by disease, inflammation, diet, or toxin exposure — expand into clones because they confer a survival advantage. The liver of a patient with metabolic fatty liver disease, for example, may harbor thousands of hepatocytes carrying loss-of-function mutations in CIDEB or GPAM — genes that, when disrupted, reduce lipid accumulation. The disease created the selective pressure. The genome responded. Nature revealed the drug target. The authors propose a four-step framework to systematically mine this signal: select cells of interest, sequence them with high-accuracy methods, decipher which genes are under selection using dN/dS analysis, and validate phenotypic impact. It's germline genetics, but inverted — instead of asking what variants people are born with, you ask what variants their tissues chose under pressure. Crucially, this framework requires single-cell resolution sequencing with minimal amplification error. Technologies like BioSkryb Genomics' Primary Template-directed Amplification (PTA) — cited in the paper as an enabling single-cell sequencing approach — are what make this level of discovery possible. The therapeutic implications span liver disease, epilepsy, autoimmunity, cardiovascular disease, and cancer immunotherapy. Your body has been running the clinical trial. We're just learning how to read the results. Citation: Brunner SF, Martincorena I, Mannino G, Fox CS, Stratton MR, Rubens JR, Campbell PJ, and Zhu H. "Somatic genomics as a discovery engine for biomedicine." Cell 189(5): 1269–1286. March 5, 2026. DOI: 10.1016/j.cell.2026.01.032.
-
NotebookLM: "...[described is] a novel technique called nucleotide dependency analysis to enhance the interpretability of genomic language models (gLMs) and detect functional elements within DNA sequences. By quantifying how a single nucleotide substitution affects the predicted probability of another nucleotide, this method effectively uncovers functional relationships that existing gLM reconstruction methods often miss. The researchers demonstrate that these dependencies are superior at indicating the deleteriousness of genetic variants and can accurately map diverse genomic features, including regulatory motifs, interactions between distal elements like splice sites, and complex RNA secondary and tertiary structures, including pseudoknots, all in an alignment-free manner. Ultimately, dependency maps serve as a powerful new tool for dissecting the regulatory code and diagnosing the limitations of different gLM architectures and training data choices." From the source: "...we introduced nucleotide dependencies that quantify how nucleotide substitutions at one genomic position affect the likelihood of nucleotides at another position. This new metric appears as a general and effective approach to identifying functionally related nucleotides using gLMs. Nucleotide dependency maps reveal functional elements across various biological processes, including transcriptional, post-transcriptional regulatory elements, their interactions and RNA folding. Therefore, this new metric has implications across multiple areas of computational and genome biology." https://lnkd.in/ebVkQHp8
-
Genomics just did what radiology has failed to do in forty years: make its data open, standardized, and accessible in one line of code. OpenMed released 1.14 billion rows of psychiatric genetics data on Hugging Face. Every GWAS ever published by the Psychiatric Genomics Consortium - 52 studies, 12 conditions: ADHD, depression, schizophrenia, bipolar disorder, PTSD, OCD, autism, anxiety, Tourette syndrome, eating disorders. Standardized. Parquet-formatted. CC BY 4.0. One line of Python. Previously this meant hunting FTP servers, parsing inconsistent formats, spending more time on data engineering than science. Now a PhD student with a laptop can query the genetic architecture of psychiatric comorbidity in minutes. Each row is a single variant-phenotype association test — variant ID, genomic location, effect size, p-value, allele frequency, sample sizes. A typical GWAS tests 7-15 million variants. Fifty-two studies with multiple ancestry groups. 1.14 billion rows. Why this matters beyond psychiatry. I work in diagnostic radiology. We still debate DICOM interoperability - a standard from the 1980s. Our imaging data lives in proprietary PACS silos. AI tools cannot share training data across institutions. Meanwhile, genomics just put its multi-decade evidence base on a platform where any researcher can access it with a single API call. The gap is cultural. Radiology generates millions of studies daily. But we treat data as a liability to lock down, not an asset to open up. What OpenMed shows is that open data infrastructure does not require new science. It requires the decision to standardize what already exists and make it accessible. The PGC generated this data over years. OpenMed made it usable in days. Shared genetic architecture across conditions - depression with anxiety, ADHD with autism - is the genomic version of the multi-system diagnostic problem I wrote about last week. And ancestry-stratified data enabling research beyond European populations mirrors what radiology needs: AI models that work across demographics. This is what happens when a field decides open data is a feature, not a risk. When will diagnostic imaging make the same decision? #OpenData #Genomics #Psychiatry #MedicalAI #Radiology #DataScience #DiagnosticMedicine
-
A new article [in the comments] leverages computational methods to integrate high-dimensional genomic and neuroimaging data to uncover the developmental role of regional gene expression differences in the human cortex and their association with neurodevelopmental disorders like autism spectrum disorder (ASD) and schizophrenia (SCZ). The study explores how cortical gene expression dynamics during different developmental stages correlate with the structural and functional organization of the human brain, and how these patterns might deviate in neurodevelopmental disorders. Using a computational framework, the study analyzes gene expression data from the Allen Human Brain Atlas in conjunction with neuroimaging data and other genomic datasets like PsychENCODE. Data analytics and dimension reduction methods (e.g., PCA, DME) are employed to identify robust patterns. Findings: [1] The analysis highlights three major transcriptional components (C1, C2, C3) that correspond to different aspects of cerebral function and linkage to disorders. C1 is associated with neuron-specific patterns, C2 with metabolic processes, and C3 with synaptic planning and immune responses. [2] These components show distinct temporal patterns across fetal to adolescent brain development, with implications for understanding the evolution of cortical functions. [3] C1 and C2 show a strong correlation with ASD across multiple data modalities, whereas C3 is more closely associated with SCZ. This highlights how different developmental trajectories and gene expression disruptions can relate to specific clinical outcomes. Implications for Computational Psychiatry: [1] The research demonstrates the utility of integrating genomic, transcriptomic, and neuroimaging data in a computational framework to study complex brain disorders, providing a more comprehensive understanding of the underpinnings of these conditions. [2] The identified gene expression components could further be utilized to develop predictive models for identifying individuals at high risk for these disorders based on their cortical gene expression patterns. [3] Understanding specific gene-environment interactions that lead to disorder-specific deviations from normal cortical development might open up new avenues for targeted therapeutic interventions. Conclusion: The study effectively uses computational tools to link high-dimensional biological data with brain organization and disorder phenotypes and makes a significant contribution by providing insights into the molecular mechanisms contributing to neurodevelopmental disorders. This computational approach not only uncovers the intricate gene expression dynamics that shape the human cortex but also illustrates how deviations from these normative patterns are associated with clinical conditions, thus offering new pathways for diagnosis and treatment.
-
Massive Genomic Study Reshapes Our View of Breast Cancer Progression and Treatment Groundbreaking new research from Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Republic of Korea analyzing whole-genome sequences (WGS) from 1,364 clinically annotated breast cancers, significantly advances our understanding of the disease's genomic landscape and its profound link to patient outcomes. This large-scale study, which included deep transcriptome data, delivered several high-impact findings: • Expanded Driver Atlas: The team identified a broader range of oncogenic alterations, including novel driver genes, recurrent gene fusions, and structural variants, expanding the known repertoire of breast cancer mechanisms. • Decades-Long Timeline: Timing analysis on copy number alterations revealed that genomic instability emerges decades before a tumor is clinically diagnosed. This offers crucial new insights into the extremely early initiation of tumorigenesis, opening doors for prevention strategies. • Predictive Biomarker Power: Pattern-driven genomic features—including mutational signatures, homologous recombination deficiency (HRD), and tumor heterogeneity scores—were strongly associated with clinical outcomes. The Clinical Takeaway: These findings highlight the potential for using comprehensive WGS data to develop predictive biomarkers that can better guide therapeutic decisions for individual patients, particularly concerning the use of: • CDK4/6 inhibitors • HER2 inhibitors • Adjuvant and neoadjuvant chemotherapy This research underscores the power of large-scale, clinically integrated whole-genome sequencing to translate complex genomic data into actionable insights, ultimately driving personalized and improved patient care. #BreastCancer #Genomics #Oncology #PrecisionMedicine #WGS #Biomarkers Figure Courtesy: Nature
-
Illumina has cracked a rare disease code that's been frustrating rare disease researchers for decades. Their new PromoterAI algorithm can finally interpret the 98% of the human genome that everyone's been ignoring - the noncoding regions where regulatory variants hide. Here's why this matters: Only 30% of rare disease patients get an accurate diagnosis from exome sequencing. The other 70%? Their answers might be buried in promoter regions that control gene expression but have been impossible to decode at scale. Until now. Published in Science Magazine, PromoterAI discovered regulatory variants that contribute up to 6% of rare disease causes. When combined with Illumina's other AI tools (SpliceAI and PrimateAI-3D), they're doubling diagnostic yield compared to traditional approaches. This is more than a technical breakthrough. It's potentially life-changing for families who've spent years searching for answers. The bigger story? Illumina keeps building an AI ecosystem that turns genomic data into actionable insights. They're not just selling sequencers anymore - they're becoming the intelligence layer for precision medicine. Every rare disease diagnosis that was previously impossible just became possible.
-
Last week I stood in front of 50 bioinformaticians and ran a pharmacogenomics analysis in under one second. No cloud. No data leaving the room. Within 24 hours, a researcher I'd never met submitted a pull request adding a nutrigenomics skill I hadn't planned. That's how ClawBio started. The problem: general-purpose AI is powerful but blind to biology. It hallucinates star allele calls. It uses outdated CPIC guidelines. And you can't send patient genomes to a cloud API. ClawBio fixes this. It's a skill library that gives AI agents real bioinformatics expertise — pharmacogenomics, equity scoring, metagenomics, nutrigenomics — all running locally on your machine. What we shipped in one week: - 7 production skills (PharmGx, Equity Scorer, NutriGx, Metagenomics, and more) - 57 automated tests, CI on 3 Python versions - 1 community contribution merged in 24 hours - Published on ClawHub registry What I learned: 1. Methodology before code — a detailed spec is itself useful 2. Local-first isn't a limitation, it's the moat 3. One unsolicited PR proves architecture more than any benchmark 4. Tests are trust signals — it's why I merged fast 8 more skills are waiting for contributors: VCF annotation, scRNA-seq, protein structure, lit synthesis. If you work with genomic data and want to build: github.com/ClawBio/ClawBio MIT licensed. Every analysis ships with a reproducibility bundle. #Bioinformatics #AI #Genomics #OpenSource #Pharmacogenomics #HealthEquity
-
We’ve stopped asking how AI can make us more efficient and started asking what becomes possible at a scale that was previously unachievable. With data, models, and compute accelerating, the new constraint is no longer access, but our ability to reason across it all. That’s why, at the AstraZeneca Centre for Genomics Research, we’re pioneering agentic systems that navigate massive datasets, interact with tools, and surface mechanistic hypotheses where literature is sparse, giving our teams a clearer, faster path to high‑value targets. It’s early, and we’re learning. In the article below, I explain why this shift from high-throughput discovery to scalable biological reasoning feels anything but incremental.
-
Hello everyone, It’s been a while since I’ve been active here and on my YouTube channel. Over the past few months, I’ve embarked on a steep learning curve while transitioning to a new team at Arcus - the group building institution-wide omics resources at the Children’s Hospital of Philadelphia (CHOP). The research-lab skills I brought with me needed a significant upgrade to tackle projects at this scale. Here’s what I’ve learned in the last six months: 1. Cloud-Native Data Platforms Working at an institutional level means handling massive datasets that require standardized processing. This has given me hands-on experience with several AWS services including: Amazon S3 (storage), AWS HealthOmics (genomic data processing), Athena (as a serverless SQL query engine, also learned and improved SQL skills), AWS Lambda (event-driven serverless computing), EventBridge (to build event-triggered pipelines) and ECR (to host and version custom Docker container images). 2. Advanced Workflow Orchestration I shifted from traditional HPC schedulers to Kubernetes-based systems: - Learning Argo Workflows to run container-native pipelines - Writing portable, reproducible pipelines in Nextflow and WDL This change has improved resource utilization, portability, and collaboration across teams. 3. New Cutting-Edge Genomics Platforms I am currently exploring Illumina Connected Analytics (ICA), a secure cloud-based bioinformatics platform, and leveraged DRAGEN secondary analysis pipelines to accelerate data processing by up to 10×, ensure consistency, and maintain compliance at scale. 4. Project Management with Agile & Scrum - Transitioning to an Agile framework with SCRUM methodology has transformed how I approach projects. - Sprint planning, backlog refinement, and retrospectives have provided structured ways to evaluate progress and identify improvement opportunities. - This systematic approach has enhanced both individual and team productivity by creating clear timelines and accountability. 5. Leadership & Collaboration Serving principal investigators across CHOP, I’ve led stakeholder meetings to understand project needs and delivered harmonized, standardized data products. This exposure to diverse projects has broadened my understanding of different research areas and strengthened my ability to translate technical capabilities into meaningful scientific contributions. The importance of continuously updating skills, knowledge, and experience across different platforms, technologies, and methods cannot be overstated in today's rapidly evolving technical landscape. I still have a long way to go in mastering these skills, but as they say, getting comfortable with being uncomfortable is the best way to grow. Honestly, the chance to learn new things really excites me! #Bioinformatics #Genomics #CloudComputing #Agile #Leadership #AWS #Kubernetes
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development