IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0068141.html
   My bibliography  Save this article

Normalizing RNA-Sequencing Data by Modeling Hidden Covariates with Prior Knowledge

Author

Listed:
  • Sara Mostafavi
  • Alexis Battle
  • Xiaowei Zhu
  • Alexander E Urban
  • Douglas Levinson
  • Stephen B Montgomery
  • Daphne Koller

Abstract

Transcriptomic assays that measure expression levels are widely used to study the manifestation of environmental or genetic variations in cellular processes. RNA-sequencing in particular has the potential to considerably improve such understanding because of its capacity to assay the entire transcriptome, including novel transcriptional events. However, as with earlier expression assays, analysis of RNA-sequencing data requires carefully accounting for factors that may introduce systematic, confounding variability in the expression measurements, resulting in spurious correlations. Here, we consider the problem of modeling and removing the effects of known and hidden confounding factors from RNA-sequencing data. We describe a unified residual framework that encapsulates existing approaches, and using this framework, present a novel method, HCP (Hidden Covariates with Prior). HCP uses a more informed assumption about the confounding factors, and performs as well or better than existing approaches while having a much lower computational cost. Our experiments demonstrate that accounting for known and hidden factors with appropriate models improves the quality of RNA-sequencing data in two very different tasks: detecting genetic variations that are associated with nearby expression variations (cis-eQTLs), and constructing accurate co-expression networks.

Suggested Citation

  • Sara Mostafavi & Alexis Battle & Xiaowei Zhu & Alexander E Urban & Douglas Levinson & Stephen B Montgomery & Daphne Koller, 2013. "Normalizing RNA-Sequencing Data by Modeling Hidden Covariates with Prior Knowledge," PLOS ONE, Public Library of Science, vol. 8(7), pages 1-10, July.
  • Handle: RePEc:plo:pone00:0068141
    DOI: 10.1371/journal.pone.0068141
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0068141
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0068141&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0068141?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Nicoló Fusi & Oliver Stegle & Neil D Lawrence, 2012. "Joint Modelling of Confounding Factors and Prominent Genetic Regulators Provides Increased Accuracy in Genetical Genomics Studies," PLOS Computational Biology, Public Library of Science, vol. 8(1), pages 1-9, January.
    2. Oliver Stegle & Leopold Parts & Richard Durbin & John Winn, 2010. "A Bayesian Framework to Account for Complex Non-Genetic Factors in Gene Expression Levels Greatly Increases Power in eQTL Studies," PLOS Computational Biology, Public Library of Science, vol. 6(5), pages 1-11, May.
    3. Edward M. Marcotte & Matteo Pellegrini & Michael J. Thompson & Todd O. Yeates & David Eisenberg, 1999. "A combined algorithm for genome-wide prediction of protein function," Nature, Nature, vol. 402(6757), pages 83-86, November.
    4. Stephen B. Montgomery & Micha Sammeth & Maria Gutierrez-Arcelus & Radoslaw P. Lach & Catherine Ingle & James Nisbett & Roderic Guigo & Emmanouil T. Dermitzakis, 2010. "Transcriptome genetics using second generation sequencing in a Caucasian population," Nature, Nature, vol. 464(7289), pages 773-777, April.
    5. Barbara E Stranger & Stephen B Montgomery & Antigone S Dimas & Leopold Parts & Oliver Stegle & Catherine E Ingle & Magda Sekowska & George Davey Smith & David Evans & Maria Gutierrez-Arcelus & Alkes P, 2012. "Patterns of Cis Regulatory Variation in Diverse Human Populations," PLOS Genetics, Public Library of Science, vol. 8(4), pages 1-13, April.
    6. Jeffrey T Leek & John D Storey, 2007. "Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis," PLOS Genetics, Public Library of Science, vol. 3(9), pages 1-12, September.
    7. Leopold Parts & Oliver Stegle & John Winn & Richard Durbin, 2011. "Joint Genetic Analysis of Gene Expression Data with Inferred Cellular Phenotypes," PLOS Genetics, Public Library of Science, vol. 7(1), pages 1-10, January.
    8. Barbara E Engelhardt & Matthew Stephens, 2010. "Analysis of Population Structure: A Unifying Framework and Novel Methods Based on Sparse Factor Analysis," PLOS Genetics, Public Library of Science, vol. 6(9), pages 1-12, September.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Federico Innocenti & Gregory M Cooper & Ian B Stanaway & Eric R Gamazon & Joshua D Smith & Snezana Mirkov & Jacqueline Ramirez & Wanqing Liu & Yvonne S Lin & Cliona Moloney & Shelly Force Aldred & Nat, 2011. "Identification, Replication, and Functional Fine-Mapping of Expression Quantitative Trait Loci in Primary Human Liver Tissue," PLOS Genetics, Public Library of Science, vol. 7(5), pages 1-16, May.
    2. Jin Hyun Ju & Sushila A Shenoy & Ronald G Crystal & Jason G Mezey, 2017. "An independent component analysis confounding factor correction framework for identifying broad impact expression quantitative trait loci," PLOS Computational Biology, Public Library of Science, vol. 13(5), pages 1-26, May.
    3. Chuan Gao & Ian C McDowell & Shiwen Zhao & Christopher D Brown & Barbara E Engelhardt, 2016. "Context Specific and Differential Gene Co-expression Networks via Bayesian Biclustering," PLOS Computational Biology, Public Library of Science, vol. 12(7), pages 1-39, July.
    4. Seong Kyu Han & Michelle T. McNulty & Christopher J. Benway & Pei Wen & Anya Greenberg & Ana C. Onuchic-Whitford & Dongkeun Jang & Jason Flannick & Noël P. Burtt & Parker C. Wilson & Benjamin D. Humph, 2023. "Mapping genomic regulation of kidney disease and traits through high-resolution and interpretable eQTLs," Nature Communications, Nature, vol. 14(1), pages 1-16, December.
    5. Barbara E Stranger & Stephen B Montgomery & Antigone S Dimas & Leopold Parts & Oliver Stegle & Catherine E Ingle & Magda Sekowska & George Davey Smith & David Evans & Maria Gutierrez-Arcelus & Alkes P, 2012. "Patterns of Cis Regulatory Variation in Diverse Human Populations," PLOS Genetics, Public Library of Science, vol. 8(4), pages 1-13, April.
    6. Nicoló Fusi & Oliver Stegle & Neil D Lawrence, 2012. "Joint Modelling of Confounding Factors and Prominent Genetic Regulators Provides Increased Accuracy in Genetical Genomics Studies," PLOS Computational Biology, Public Library of Science, vol. 8(1), pages 1-9, January.
    7. Kaido Lepik & Tarmo Annilo & Viktorija Kukuškina & eQTLGen Consortium & Kai Kisand & Zoltán Kutalik & Pärt Peterson & Hedi Peterson, 2017. "C-reactive protein upregulates the whole blood expression of CD59 - an integrative analysis," PLOS Computational Biology, Public Library of Science, vol. 13(9), pages 1-20, September.
    8. Farnoosh Abbas-Aghababazadeh & Qian Li & Brooke L Fridley, 2018. "Comparison of normalization approaches for gene expression studies completed with high-throughput sequencing," PLOS ONE, Public Library of Science, vol. 13(10), pages 1-21, October.
    9. Estavoyer, Maxime & François, Olivier, 2022. "Theoretical analysis of principal components in an umbrella model of intraspecific evolution," Theoretical Population Biology, Elsevier, vol. 148(C), pages 11-21.
    10. Marttinen Pekka & Gillberg Jussi & Havulinna Aki & Corander Jukka & Kaski Samuel, 2013. "Genome-wide association studies with high-dimensional phenotypes," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 12(4), pages 413-431, August.
    11. Arjun Bhattacharya & Anastasia N. Freedman & Vennela Avula & Rebeca Harris & Weifang Liu & Calvin Pan & Aldons J. Lusis & Robert M. Joseph & Lisa Smeester & Hadley J. Hartwell & Karl C. K. Kuban & Car, 2022. "Placental genomics mediates genetic associations with complex health traits and disease," Nature Communications, Nature, vol. 13(1), pages 1-15, December.
    12. repec:jss:jstsof:40:i14 is not listed on IDEAS
    13. Boca, Simina M. & Rosenberg, Noah A., 2011. "Mathematical properties of Fst between admixed populations and their parental source populations," Theoretical Population Biology, Elsevier, vol. 80(3), pages 208-216.
    14. Won Jun Lee & Sang Cheol Kim & Jung-Ho Yoon & Sang Jun Yoon & Johan Lim & You-Sun Kim & Sung Won Kwon & Jeong Hill Park, 2016. "Meta-Analysis of Tumor Stem-Like Breast Cancer Cells Using Gene Set and Network Analysis," PLOS ONE, Public Library of Science, vol. 11(2), pages 1-20, February.
    15. Jeanne C Latourelle & Alexandra Dumitriu & Tiffany C Hadzi & Thomas G Beach & Richard H Myers, 2012. "Evaluation of Parkinson Disease Risk Variants as Expression-QTLs," PLOS ONE, Public Library of Science, vol. 7(10), pages 1-7, October.
    16. Emanuele Aliverti & Kristian Lum & James E. Johndrow & David B. Dunson, 2021. "Removing the influence of group variables in high‐dimensional predictive modelling," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 184(3), pages 791-811, July.
    17. Marron, J.S., 2017. "Big Data in context and robustness against heterogeneity," Econometrics and Statistics, Elsevier, vol. 2(C), pages 73-80.
    18. Seungchul Baek & Yen‐Yi Ho & Yanyuan Ma, 2020. "Using sufficient direction factor model to analyze latent activities associated with breast cancer survival," Biometrics, The International Biometric Society, vol. 76(4), pages 1340-1350, December.
    19. Saedis Saevarsdottir & Kristbjörg Bjarnadottir & Thorsteinn Markusson & Jonas Berglund & Thorunn A. Olafsdottir & Gisli H. Halldorsson & Gudrun Rutsdottir & Kristbjorg Gunnarsdottir & Asgeir Orn Arnth, 2024. "Start codon variant in LAG3 is associated with decreased LAG-3 expression and increased risk of autoimmune thyroid disease," Nature Communications, Nature, vol. 15(1), pages 1-12, December.
    20. Griffin, Maryclare & Hoff, Peter D., 2019. "Lasso ANOVA decompositions for matrix and tensor data," Computational Statistics & Data Analysis, Elsevier, vol. 137(C), pages 181-194.
    21. Nuno A Fonseca & John Marioni & Alvis Brazma, 2014. "RNA-Seq Gene Profiling - A Systematic Empirical Comparison," PLOS ONE, Public Library of Science, vol. 9(9), pages 1-10, September.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0068141. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.