IDEAS home Printed from https://ideas.repec.org/a/bla/biomet/v79y2023i2p866-877.html
   My bibliography  Save this article

Multisource single‐cell data integration by MAW barycenter for Gaussian mixture models

Author

Listed:
  • Lin Lin
  • Wei Shi
  • Jianbo Ye
  • Jia Li

Abstract

One key challenge encountered in single‐cell data clustering is to combine clustering results of data sets acquired from multiple sources. We propose to represent the clustering result of each data set by a Gaussian mixture model (GMM) and produce an integrated result based on the notion of Wasserstein barycenter. However, the precise barycenter of GMMs, a distribution on the same sample space, is computationally infeasible to solve. Importantly, the barycenter of GMMs may not be a GMM containing a reasonable number of components. We thus propose to use the minimized aggregated Wasserstein (MAW) distance to approximate the Wasserstein metric and develop a new algorithm for computing the barycenter of GMMs under MAW. Recent theoretical advances further justify using the MAW distance as an approximation for the Wasserstein metric between GMMs. We also prove that the MAW barycenter of GMMs has the same expectation as the Wasserstein barycenter. Our proposed algorithm for clustering integration scales well with the data dimension and the number of mixture components, with complexity independent of data size. We demonstrate that the new method achieves better clustering results on several single‐cell RNA‐seq data sets than some other popular methods.

Suggested Citation

  • Lin Lin & Wei Shi & Jianbo Ye & Jia Li, 2023. "Multisource single‐cell data integration by MAW barycenter for Gaussian mixture models," Biometrics, The International Biometric Society, vol. 79(2), pages 866-877, June.
  • Handle: RePEc:bla:biomet:v:79:y:2023:i:2:p:866-877
    DOI: 10.1111/biom.13630
    as

    Download full text from publisher

    File URL: https://doi.org/10.1111/biom.13630
    Download Restriction: no

    File URL: https://libkey.io/10.1111/biom.13630?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Grace X. Y. Zheng & Jessica M. Terry & Phillip Belgrader & Paul Ryvkin & Zachary W. Bent & Ryan Wilson & Solongo B. Ziraldo & Tobias D. Wheeler & Geoff P. McDermott & Junjie Zhu & Mark T. Gregory & Jo, 2017. "Massively parallel digital transcriptional profiling of single cells," Nature Communications, Nature, vol. 8(1), pages 1-12, April.
    2. Zhe Sun & Li Chen & Hongyi Xin & Yale Jiang & Qianhui Huang & Anthony R. Cillo & Tracy Tabib & Jay K. Kolls & Tullia C. Bruno & Robert Lafyatis & Dario A. A. Vignali & Kong Chen & Ying Ding & Ming Hu , 2019. "A Bayesian mixture model for clustering droplet-based single-cell transcriptomic data from population studies," Nature Communications, Nature, vol. 10(1), pages 1-10, December.
    3. Cheng Li & Sanvesh Srivastava & David B. Dunson, 2017. "Simple, scalable and accurate posterior interval estimation," Biometrika, Biometrika Trust, vol. 104(3), pages 665-680.
    4. Max Sommerfeld & Axel Munk, 2018. "Inference for empirical Wasserstein distances on finite spaces," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 80(1), pages 219-238, January.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Meiling Zheng & Zhi Hu & Xiaole Mei & Lianlian Ouyang & Yang Song & Wenhui Zhou & Yi Kong & Ruifang Wu & Shijia Rao & Hai Long & Wei Shi & Hui Jing & Shuang Lu & Haijing Wu & Sujie Jia & Qianjin Lu & , 2022. "Single-cell sequencing shows cellular heterogeneity of cutaneous lesions in lupus erythematosus," Nature Communications, Nature, vol. 13(1), pages 1-17, December.
    2. Espen Bernton & Pierre E. Jacob & Mathieu Gerber & Christian P. Robert, 2019. "Approximate Bayesian computation with the Wasserstein distance," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 81(2), pages 235-269, April.
    3. Ajita Shree & Musale Krushna Pavan & Hamim Zafar, 2023. "scDREAMER for atlas-level integration of single-cell datasets using deep generative model paired with adversarial classifier," Nature Communications, Nature, vol. 14(1), pages 1-19, December.
    4. Mario Ghossoub & Jesse Hall & David Saunders, 2020. "Maximum Spectral Measures of Risk with given Risk Factor Marginal Distributions," Papers 2010.14673, arXiv.org.
    5. Tengyuan Liang, 2020. "Estimating Certain Integral Probability Metrics (IPMs) Is as Hard as Estimating under the IPMs," Working Papers 2020-153, Becker Friedman Institute for Research In Economics.
    6. del Barrio, Eustasio & Gordaliza, Paula & Lescornel, Hélène & Loubes, Jean-Michel, 2019. "Central limit theorem and bootstrap procedure for Wasserstein’s variations with an application to structural relationships between distributions," Journal of Multivariate Analysis, Elsevier, vol. 169(C), pages 341-362.
    7. Snehalika Lall & Sumanta Ray & Sanghamitra Bandyopadhyay, 2022. "A copula based topology preserving graph convolution network for clustering of single-cell RNA-seq data," PLOS Computational Biology, Public Library of Science, vol. 18(3), pages 1-16, March.
    8. Niu, Baozhuang & Chen, Yuyang & Zeng, Fanzhuo, 2023. "One step further for procurement cooperation: Will the industry leader benefit from its competitive manufacturer's joint determination of consumption quality?," European Journal of Operational Research, Elsevier, vol. 311(3), pages 989-1008.
    9. Valentin Hartmann & Dominic Schuhmacher, 2020. "Semi-discrete optimal transport: a solution procedure for the unsquared Euclidean distance case," Mathematical Methods of Operations Research, Springer;Gesellschaft für Operations Research (GOR);Nederlands Genootschap voor Besliskunde (NGB), vol. 92(1), pages 133-163, August.
    10. Murray Pollock & Paul Fearnhead & Adam M. Johansen & Gareth O. Roberts, 2020. "Quasi‐stationary Monte Carlo and the ScaLE algorithm," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 82(5), pages 1167-1221, December.
    11. Marcel Klatt & Axel Munk & Yoav Zemel, 2022. "Limit laws for empirical optimal solutions in random linear programs," Annals of Operations Research, Springer, vol. 315(1), pages 251-278, August.
    12. Qunlun Shen & Shihua Zhang, 2021. "Approximate distance correlation for selecting highly interrelated genes across datasets," PLOS Computational Biology, Public Library of Science, vol. 17(11), pages 1-18, November.
    13. Jinzhou Li & Marloes H. Maathuis, 2021. "GGM knockoff filter: False discovery rate control for Gaussian graphical models," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 83(3), pages 534-558, July.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:biomet:v:79:y:2023:i:2:p:866-877. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: http://www.blackwellpublishing.com/journal.asp?ref=0006-341X .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.