IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1006794.html
   My bibliography  Save this article

A complete statistical model for calibration of RNA-seq counts using external spike-ins and maximum likelihood theory

Author

Listed:
  • Rodoniki Athanasiadou
  • Benjamin Neymotin
  • Nathan Brandt
  • Wei Wang
  • Lionel Christiaen
  • David Gresham
  • Daniel Tranchina

Abstract

A fundamental assumption, common to the vast majority of high-throughput transcriptome analyses, is that the expression of most genes is unchanged among samples and that total cellular RNA remains constant. As the number of analyzed experimental systems increases however, different independent studies demonstrate that this assumption is often violated. We present a calibration method using RNA spike-ins that allows for the measurement of absolute cellular abundance of RNA molecules. We apply the method to pooled RNA from cell populations of known sizes. For each transcript, we compute a nominal abundance that can be converted to absolute by dividing by a scale factor determined in separate experiments: the yield coefficient of the transcript relative to that of a reference spike-in measured with the same protocol. The method is derived by maximum likelihood theory in the context of a complete statistical model for sequencing counts contributed by cellular RNA and spike-ins. The counts are based on a sample from a fixed number of cells to which a fixed population of spike-in molecules has been added. We illustrate and evaluate the method with applications to two global expression data sets, one from the model eukaryote Saccharomyces cerevisiae, proliferating at different growth rates, and differentiating cardiopharyngeal cell lineages in the chordate Ciona robusta. We tested the method in a technical replicate dilution study, and in a k-fold validation study.Author summary: We present a complete statistical model for the analysis of RNA-seq data from a population of cells using external RNA spike-ins and a maximum-likelihood method for genome-wide estimation of transcripts per cell. The model includes biological variability of cellular transcript number and sampling noise. We derive an unbiased estimator of transcripts per cell for every transcript, given by simply multiplying the count by a library-dependent, but transcript-independent, scale factor. This is a nominal estimate that can be converted to an absolute estimate by dividing by the transcript’s relative yield coefficient, measured in a separate experiment. A negative binomial probability mass function with novel normalization (size) factors allows for parametric testing of hypotheses concerning dependence of the absolute abundance of each transcript on experimental condition. Our method integrates information from every RNA-seq experiment across all replicates and experimental conditions to determine the calibration constants. We test the method with a dilution study and a k-fold cross-validation study. We illustrate our method with applications to two independent data sets from yeast and the sea squirt that were derived by different library preparation protocols. We show that our methods detect genome-wide amplification of expression, and we compare our method to others.

Suggested Citation

  • Rodoniki Athanasiadou & Benjamin Neymotin & Nathan Brandt & Wei Wang & Lionel Christiaen & David Gresham & Daniel Tranchina, 2019. "A complete statistical model for calibration of RNA-seq counts using external spike-ins and maximum likelihood theory," PLOS Computational Biology, Public Library of Science, vol. 15(3), pages 1-26, March.
  • Handle: RePEc:plo:pcbi00:1006794
    DOI: 10.1371/journal.pcbi.1006794
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006794
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1006794&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1006794?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Hyeon-Jin Kim & Greg Booth & Lauren Saunders & Sanjay Srivatsan & José L. McFaline-Figueroa & Cole Trapnell, 2022. "Nuclear oligo hashing improves differential analysis of single-cell RNA-seq," Nature Communications, Nature, vol. 13(1), pages 1-12, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1006794. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.