Topic Modeling in Genomic Data Analysis

Puzhou Wang

Published Nov 30, 2019

A topic model is a type of statistical model for discovering the abstract topics based on a collection of documents. The origin of a topic model is latent semantic indexing (LSI). However, LSI is not an authentic topic model, as it is not a probabilistic model. Based on LSI, probabilistic latent semantic analysis (PLSA) was created as a genuine topic model. Latent Dirichlet allocation (LDA) was later published as a generalization of PLSA.

In the field of Data Science, topic modeling has been widely applied for many different tasks. It can be used for clustering and classification of text data. Moreover, when provided with associated non-text data, topic modeling, with some modifications, can become very powerful. For instance, contextual probabilistic latent semantic analysis (CPLSA) incorporates context information into the traditional PLSA algorithm, which enables it to answer some very interesting questions, such as how we changed our opinions towards electronic cars after the big announcement of Tesla's Cybertruck?

In recent years, the extraction of hidden knowledge and relations from biological data has been posed as a great challenge, especially with current exponential growth in the amount of these data mostly powered by microarray and Next-Generation Sequencing (NGS) assays. As an effective method for discovering useful structures in collections, topic modeling has been utilized by more and more researchers to answer such challenge.

Genomic data clustering is probably the most direct applications of topic modeling. For example, correspondence LDA (Corr-LDA) was modified to identify functional microRNA regulatory modules (FMRMs) in expression microarray profiles of microRNAs and mRNAs. With some adaptations, topic modeling can also be used for genomic data classification, even though traditional topic models are unsupervised algorithms. For instance, when true taxonomic labels were included in the training set, LDA could be used for the classification of 16S DNA sequences from Ribosomal Database Project (RDP) repository. Moreover, topic modeling can also be used for feature extraction in genomic data. The features resulted from topic models can then be applied as inputs for other algorithms for further analysis. Such strategy has been used for assigning metagenomic reads to different species. In this case, the hidden "topics" learned from the original sequencing data were then fed into SKWIC, a variant of the classical K-means algorithm, for clustering.

As discussed above, topic modeling can accomplish a lot of tasks for genomic data analysis. I believe that the applications of topic models to genomics are just beginning, and will be soon embraced by more and more scientists.

To view or add a comment, sign in

Topic Modeling in Genomic Data Analysis

Puzhou Wang

More articles by Puzhou Wang

Others also viewed

7 Impacts the Explosion in Genomic Data Had on IT

Reconstructing DNA: It’s Complicated, But We’ve Got Algorithms

Thoughts on The Analogy Between Digital Enterprise and The Human Genetic System – by Alaa Mahjoub

Turning Unphased SNP Data into Actionable Haplotypes with Python

A search engine for the human genome

Mastering String Manipulation in R: Essential Functions for Bioinformatics

Taking Control of Your Data: Why Ownership Matters in Genomic Analysis

From Zero to Genome Visualization: A New Skill Unlocked!

Fast Processing of Human Exome and Whole Genome Data

The Miracle of Microarray Data Analysis

Explore content categories