Integrative Clustering Analysis with Application in Multi-Source Gene Expression Data
Volume 20, Issue 1 (2022), pp. 14–33
Pub. online: 9 November 2021
Type: Statistical Data Science
Open Access
Received
6 April 2021
6 April 2021
Accepted
12 October 2021
12 October 2021
Published
9 November 2021
9 November 2021
Abstract
In omics studies, different sources of information about the same set of genes are often available. When the group structure (e.g., gene pathways) within the genes are of interests, we combine the normal hierarchical model with the stochastic block model, through an integrative clustering framework, to model gene expression and gene networks jointly. The integrative framework provides higher accuracy in extensive simulation studies when one or both of the data sources contain noises or when different data sources provide complementary information. An empirical guideline in the choice between integrative versus separate clustering models is proposed. The integrative clustering method is illustrated on the mouse embryo single cell RNAseq and bulk cell microarray data, which identified not only the gene sets shared by both data sources but also the gene sets unique in one data source.
Supplementary material
Supplementary MaterialCode for the integrative analysis and the data used in the real data analysis are available at https://github.com/yangliuqing1992/Integrative_clustering.