IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1003531.html
   My bibliography  Save this article

Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible

Author

Listed:
  • Paul J McMurdie
  • Susan Holmes

Abstract

Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called metagenomeSeq. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.Author Summary: The term microbiome refers to the ecosystem of microbes that live in a defined environment. The decreasing cost and increasing speed of DNA sequencing technology has recently provided scientists with affordable and timely access to the genes and genomes of microbiomes that inhabit our planet and even our own bodies. In these investigations many microbiome samples are sequenced at the same time on the same DNA sequencing machine, but often result in total numbers of sequences per sample that are vastly different. The common procedure for addressing this difference in sequencing effort across samples – different library sizes – is to either (1) base analyses on the proportional abundance of each species in a library, or (2) rarefy, throw away sequences from the larger libraries so that all have the same, smallest size. We show that both of these normalization methods can work when comparing obviously-different whole microbiomes, but that neither method works well when comparing the relative proportions of each bacterial species across microbiome samples. We show that alternative methods based on a statistical mixture model perform much better and can be easily adapted from a separate biological sub-discipline, called RNA-Seq analysis.

Suggested Citation

  • Paul J McMurdie & Susan Holmes, 2014. "Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible," PLOS Computational Biology, Public Library of Science, vol. 10(4), pages 1-12, April.
  • Handle: RePEc:plo:pcbi00:1003531
    DOI: 10.1371/journal.pcbi.1003531
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003531
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1003531&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1003531?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Cameron,A. Colin & Trivedi,Pravin K., 2013. "Regression Analysis of Count Data," Cambridge Books, Cambridge University Press, number 9781107667273.
    2. James Robert White & Niranjan Nagarajan & Mihai Pop, 2009. "Statistical Methods for Detecting Differentially Abundant Features in Clinical Metagenomic Samples," PLOS Computational Biology, Public Library of Science, vol. 5(4), pages 1-11, April.
    3. Wickham, Hadley, 2007. "Reshaping Data with the reshape Package," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 21(i12).
    4. Tanya Yatsunenko & Federico E. Rey & Mark J. Manary & Indi Trehan & Maria Gloria Dominguez-Bello & Monica Contreras & Magda Magris & Glida Hidalgo & Robert N. Baldassano & Andrey P. Anokhin & Andrew C, 2012. "Human gut microbiome viewed across age and geography," Nature, Nature, vol. 486(7402), pages 222-227, June.
    5. Wickham, Hadley, 2011. "The Split-Apply-Combine Strategy for Data Analysis," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 40(i01).
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Allison G. White & George S. Watts & Zhenqiang Lu & Maria M. Meza-Montenegro & Eric A. Lutz & Philip Harber & Jefferey L. Burgess, 2014. "Environmental Arsenic Exposure and Microbiota in Induced Sputum," IJERPH, MDPI, vol. 11(2), pages 1-15, February.
    2. Miller, Christine M.F. & Waterhouse, Hannah & Harter, Thomas & Fadel, James G. & Meyer, Deanne, 2020. "Quantifying the uncertainty in nitrogen application and groundwater nitrate leaching in manure based cropping systems," Agricultural Systems, Elsevier, vol. 184(C).
    3. Sarlas, Georgios & Páez, Antonio & Axhausen, Kay W., 2020. "Betweenness-accessibility: Estimating impacts of accessibility on networks," Journal of Transport Geography, Elsevier, vol. 84(C).
    4. Marin FOTACHE & Florin DUMITRU & Valerica GREAVU-SERBAN, 2015. "An Information Systems Master Programme in Romania. Some Commonalities and Specificities," Informatica Economica, Academy of Economic Studies - Bucharest, Romania, vol. 19(3), pages 5-18.
    5. Martijn Van Heel & Dinska Van Gucht & Koen Vanbrabant & Frank Baeyens, 2017. "The Importance of Conditioned Stimuli in Cigarette and E-Cigarette Craving Reduction by E-Cigarettes," IJERPH, MDPI, vol. 14(2), pages 1-18, February.
    6. Sean McKenzie & Hilary Parkinson & Jane Mangold & Mary Burrows & Selena Ahmed & Fabian Menalled, 2018. "Perceptions, Experiences, and Priorities Supporting Agroecosystem Management Decisions Differ among Agricultural Producers, Consultants, and Researchers," Sustainability, MDPI, vol. 10(11), pages 1-19, November.
    7. Milad Abbasiharofteh & Tom Broekel, 2021. "Still in the shadow of the wall? The case of the Berlin biotechnology cluster," Environment and Planning A, , vol. 53(1), pages 73-94, February.
    8. Andee J. Kaplan & Eric R. Hare, 2019. "Putting down roots: a graphical exploration of community attachment," Computational Statistics, Springer, vol. 34(4), pages 1449-1464, December.
    9. Stefan LINGNER & Eiko THIESSEN & Kerrin MÜLLER & Eberhard HARTUNG, 2018. "Dry Biomass Estimation of Hedge Banks: Allometric Equation vs. Structure from Motion via Unmanned Aerial Vehicle," Journal of Forest Science, Czech Academy of Agricultural Sciences, vol. 64(4), pages 149-156.
    10. Wickham, Hadley, 2014. "Tidy Data," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 59(i10).
    11. Cornelius J. König & Clemens B. Fell & Linus Kellnhofer & Gabriel Schui, 2015. "Are there gender differences among researchers from industrial/organizational psychology?," Scientometrics, Springer;Akadémiai Kiadó, vol. 105(3), pages 1931-1952, December.
    12. C. Sean Burns & Charles W. Fox, 2017. "Language and socioeconomics predict geographic variation in peer review outcomes at an ecology journal," Scientometrics, Springer;Akadémiai Kiadó, vol. 113(2), pages 1113-1127, November.
    13. Martín, Belén & Páez, Antonio, 2019. "Individual and geographic variations in the propensity to travel by active modes in Vitoria-Gasteiz, Spain," Journal of Transport Geography, Elsevier, vol. 76(C), pages 103-113.
    14. Fiona B. Tamburini & Dylan Maghini & Ovokeraye H. Oduaran & Ryan Brewster & Michaella R. Hulley & Venesa Sahibdeen & Shane A. Norris & Stephen Tollman & Kathleen Kahn & Ryan G. Wagner & Alisha N. Wade, 2022. "Short- and long-read metagenomics of urban and rural South African gut microbiomes reveal a transitional composition and undescribed taxa," Nature Communications, Nature, vol. 13(1), pages 1-18, December.
    15. Jean Mercenier & Maria Teresa Alvarez Martinez & Andries Brandsma & Francesco Di Comite & Olga Diukanova & d'Artis Kancs & Patrizio Lecca & Montserrat Lopez-Cobo & Philippe Monfort & Damiaan Persyn & , 2016. "RHOMOLO-v2 Model Description: A spatial computable general equilibrium model for EU regions and sectors," JRC Research Reports JRC100011, Joint Research Centre.
    16. Kayla A. Cotterman & Anthony D. Kendall & Bruno Basso & David W. Hyndman, 2018. "Groundwater depletion and climate change: future prospects of crop production in the Central High Plains Aquifer," Climatic Change, Springer, vol. 146(1), pages 187-200, January.
    17. Chrats Melkonian & Francisco Zorrilla & Inge Kjærbølling & Sonja Blasche & Daniel Machado & Mette Junge & Kim Ib Sørensen & Lene Tranberg Andersen & Kiran R. Patil & Ahmad A. Zeidan, 2023. "Microbial interactions shape cheese flavour formation," Nature Communications, Nature, vol. 14(1), pages 1-14, December.
    18. Jana S. Dietrich & Ellen A. R. Welti & Peter Haase, 2023. "Extreme climatic events alter the aquatic insect community in a pristine German stream," Climatic Change, Springer, vol. 176(6), pages 1-16, June.
    19. Thiele, Jan C. & Nuske, Robert S. & Ahrends, Bernd & Panferov, Oleg & Albert, Matthias & Staupendahl, Kai & Junghans, Udo & Jansen, Martin & Saborowski, Joachim, 2017. "Climate change impact assessment—A simulation experiment with Norway spruce for a forest district in Central Europe," Ecological Modelling, Elsevier, vol. 346(C), pages 30-47.
    20. Wang, Xu & Zhang, Xiaobo & Xie, Zhuan & Huang, Yiping, 2016. "Roads to innovation: Firm-level evidence from China:," IFPRI discussion papers 1542, International Food Policy Research Institute (IFPRI).

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1003531. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.