Journal of Data Science logo


Login Register

  1. Home
  2. Issues
  3. Volume 20, Issue 1 (2022)
  4. Integrative Clustering Analysis with App ...

Journal of Data Science

Submit your article Information
  • Article info
  • Related articles
  • More
    Article info Related articles

Integrative Clustering Analysis with Application in Multi-Source Gene Expression Data
Volume 20, Issue 1 (2022), pp. 14–33
Liuqing Yang   Qing Pan   Yunpeng Zhao  

Authors

 
Placeholder
https://doi.org/10.6339/21-JDS1028
Pub. online: 9 November 2021      Type: Statistical Data Science      Open accessOpen Access

Received
6 April 2021
Accepted
12 October 2021
Published
9 November 2021

Abstract

In omics studies, different sources of information about the same set of genes are often available. When the group structure (e.g., gene pathways) within the genes are of interests, we combine the normal hierarchical model with the stochastic block model, through an integrative clustering framework, to model gene expression and gene networks jointly. The integrative framework provides higher accuracy in extensive simulation studies when one or both of the data sources contain noises or when different data sources provide complementary information. An empirical guideline in the choice between integrative versus separate clustering models is proposed. The integrative clustering method is illustrated on the mouse embryo single cell RNAseq and bulk cell microarray data, which identified not only the gene sets shared by both data sources but also the gene sets unique in one data source.

Supplementary material

 Supplementary Material
Code for the integrative analysis and the data used in the real data analysis are available at https://github.com/yangliuqing1992/Integrative_clustering.

References

 
Abbe E (2017). Community detection and stochastic block models: recent developments. The Journal of Machine Learning Research, 18(177): 1–86.
 
Amini AA, Chen A, Bickel PJ, Levina E (2013). Pseudo-likelihood methods for community detection in large sparse networks. The Annals of Statistics, 41(4): 2097–2122.
 
Andäng M, Moliner A, Doege CA, Ibañez CF, Ernfors P (2008). Optimized mouse ES cell culture system by suspension growth in a fully defined medium. Nature Protocols, 3(6): 1013–1017.
 
Bickel PJ, Chen A (2009). A nonparametric view of network models and Newman–Girvan and other modularities. Proceedings of the National Academy of Sciences, 106(50): 21068–21073.
 
Di Y, Schafer DW, Cumbie JS, Chang JH (2011). The NBP negative binomial model for assessing differential gene expression from RNA-Seq. Statistical Applications in Genetics and Molecular Biology, 10(1): 1–28.
 
Fletcher R (2013). Practical Methods of Optimization. John Wiley & Sons.
 
Forcato M, Romano O, Bicciato S (2021). Computational methods for the integrative analysis of single-cell data. Briefings in Bioinformatics, 22(3): 1–10.
 
Goldenberg A, Zheng AX, Fienberg SE, Airoldi EM (2010). A survey of statistical network models. Foundations and Trends in Machine Learning, 2(2): 129–233.
 
Hoadley KA, Yau C, Wolf DM, Cherniack AD, Tamborero D, Ng S, et al. (2014). Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell, 158(4): 929–944.
 
Holland PW, Laskey KB, Leinhardt S (1983). Stochastic blockmodels: First steps. Social Networks, 5(2): 109–137.
 
Hu J, Qin H, Yan T, Zhao Y (2020). Corrected bayesian information criterion for stochastic block models. Journal of the American Statistical Association, 115(532): 1771–1783.
 
Islam S, Kjällquist U, Moliner A, Zajac P, Fan JB, Lönnerberg P, et al. (2011). Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq. Genome Research, 21(7): 1160–1167.
 
Karrer B, Newman ME (2011). Stochastic blockmodels and community structure in networks. Physical Review E, 83(1): 016107.
 
Lock EF, Dunson DB (2013). Bayesian consensus clustering. Bioinformatics, 29(20): 2610–2616.
 
Lock EF, Hoadley KA, Marron JS, Nobel AB (2013). Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. The Annals of Applied Statistics, 7(1): 523–542.
 
Mo Q, Wang S, Seshan VE, Olshen AB, Schultz N, Sander C, et al. (2013). Pattern discovery and cancer gene identification in integrated cancer genomic data. Proceedings of the National Academy of Sciences, 110(11): 4245–4250.
 
Moliner A, Enfors P, Ibáñez CF, Andäng M (2008). Mouse embryonic stem cell-derived spheres with distinct neurogenic potentials. Stem Cells and Development, 17(2): 233–243.
 
Morris CN, Lysy M (2012). Shrinkage estimation in multilevel normal models. Statistical Science, 27(1): 115–134.
 
Newman ME, Clauset A (2016). Structure and inference in annotated networks. Nature Communications, 7(1): 1–11.
 
Nguyen T, Tagett R, Diaz D, Draghici S (2017). A novel approach for data integration and disease subtyping. Genome Research, 27(12): 2025–2039.
 
Perkins AD, Langston MA (2009). Threshold selection in gene co-expression networks using spectral graph theory techniques. BMC Bioinformatics, 10: 1–11.
 
Priness I, Maimon O, Ben-Gal I (2007). Evaluation of gene-expression clustering via mutual information distance measure. BMC Bioinformatics, 8(1): 1–12.
 
Rand WM (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336): 846–850.
 
Rappoport N, Shamir R (2018). Multi-omic and multi-view clustering algorithms: Review and cancer benchmark. Nucleic acids research, 46(20): 10546–10562.
 
Saldana DF, Yu Y, Feng Y (2017). How many communities are there? Journal of Computational and Graphical Statistics, 26(1): 171–181.
 
Thompson A, May MR, Moore BR, Kopp A (2020). A hierarchical bayesian mixture model for inferring the expression state of genes in transcriptomes. Proceedings of the National Academy of Sciences, 117(32): 19339–19346.
 
Wang B, Mezlini AM, Demir F, Fiume M, Tu Z, Brudno M, et al. (2014). Similarity network fusion for aggregating data types on a genomic scale. Nature Methods, 11(3): 333–337.
 
Wang H, Nie F, Huang H (2013). Multi-view clustering and feature learning via structured sparsity. In: Dasgupta S, McAllester D (eds.) International Conference on Machine Learning, 352–360.
 
Wang T, Li B, Nelson CE, Nabavi S (2019). Comparative analysis of differential gene expression analysis tools for single-cell RNA sequencing data. BMC Bioinformatics, 20(1): 1–16.
 
Wu D, Wang D, Zhang MQ, Gu J (2015). Fast dimension reduction and integrative clustering of multi-omics data using low-rank approximation: Application to cancer molecular classification. BMC Genomics, 16(1): 1022.
 
Xu Z, Ke Y, Wang Y, Cheng H, Cheng J (2012). A model-based approach to attributed graph clustering. In: Dasgupta S, McAllester D (eds.) Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, 505–516.
 
Yan B, Sarkar P (2021). Covariate regularized community detection in sparse graphs. Journal of the American Statistical Association, 116(534): 734–745.
 
Zhang B, Horvath S (2005). A general framework for weighted gene co-expression network analysis. Statistical Applications in Genetics and Molecular Biology, 4(1): 17.
 
Zhang S, Liu CC, Li W, Shen H, Laird PW, Zhou XJ (2012). Discovery of multi-dimensional modules by integrative analysis of cancer genomic data. Nucleic Acids Research, 40(19): 9379–9391.
 
Zhao Y (2017). A survey on theoretical advances of community detection in networks. Wiley Interdisciplinary Reviews: Computational Statistics, 9(5): e1403.
 
Zhao Y, Levina E, Zhu J (2012). Consistency of community detection in networks under degree-corrected stochastic block models. The Annals of Statistics, 40(4): 2266–2292.

Related articles PDF XML
Related articles PDF XML

Copyright
2022 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.
by logo by logo
Open access article under the CC BY license.

Keywords
EM algorithm empirical guidelines microarray data normal hierarchical model single cell RNAseq stochastic block model

Metrics
since February 2021
1305

Article info
views

507

PDF
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy