Topic Modeling in Genomic Data Analysis

Topic Modeling in Genomic Data Analysis

A topic model is a type of statistical model for discovering the abstract topics based on a collection of documents. The origin of a topic model is latent semantic indexing (LSI). However, LSI is not an authentic topic model, as it is not a probabilistic model. Based on LSI, probabilistic latent semantic analysis (PLSA) was created as a genuine topic model. Latent Dirichlet allocation (LDA) was later published as a generalization of PLSA.

In the field of Data Science, topic modeling has been widely applied for many different tasks. It can be used for clustering and classification of text data. Moreover, when provided with associated non-text data, topic modeling, with some modifications, can become very powerful. For instance, contextual probabilistic latent semantic analysis (CPLSA) incorporates context information into the traditional PLSA algorithm, which enables it to answer some very interesting questions, such as how we changed our opinions towards electronic cars after the big announcement of Tesla's Cybertruck?

In recent years, the extraction of hidden knowledge and relations from biological data has been posed as a great challenge, especially with current exponential growth in the amount of these data mostly powered by microarray and Next-Generation Sequencing (NGS) assays. As an effective method for discovering useful structures in collections, topic modeling has been utilized by more and more researchers to answer such challenge.

Genomic data clustering is probably the most direct applications of topic modeling. For example, correspondence LDA (Corr-LDA) was modified to identify functional microRNA regulatory modules (FMRMs) in expression microarray profiles of microRNAs and mRNAs. With some adaptations, topic modeling can also be used for genomic data classification, even though traditional topic models are unsupervised algorithms. For instance, when true taxonomic labels were included in the training set, LDA could be used for the classification of 16S DNA sequences from Ribosomal Database Project (RDP) repository. Moreover, topic modeling can also be used for feature extraction in genomic data. The features resulted from topic models can then be applied as inputs for other algorithms for further analysis. Such strategy has been used for assigning metagenomic reads to different species. In this case, the hidden "topics" learned from the original sequencing data were then fed into SKWIC, a variant of the classical K-means algorithm, for clustering.

As discussed above, topic modeling can accomplish a lot of tasks for genomic data analysis. I believe that the applications of topic models to genomics are just beginning, and will be soon embraced by more and more scientists.

To view or add a comment, sign in

More articles by Puzhou Wang

Others also viewed

Explore content categories