IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1007556.html
   My bibliography  Save this article

ForestQC: Quality control on genetic variants from next-generation sequencing data using random forest

Author

Listed:
  • Jiajin Li
  • Brandon Jew
  • Lingyu Zhan
  • Sungoo Hwang
  • Giovanni Coppola
  • Nelson B Freimer
  • Jae Hoon Sul

Abstract

Next-generation sequencing technology (NGS) enables the discovery of nearly all genetic variants present in a genome. A subset of these variants, however, may have poor sequencing quality due to limitations in NGS or variant callers. In genetic studies that analyze a large number of sequenced individuals, it is critical to detect and remove those variants with poor quality as they may cause spurious findings. In this paper, we present ForestQC, a statistical tool for performing quality control on variants identified from NGS data by combining a traditional filtering approach and a machine learning approach. Our software uses the information on sequencing quality, such as sequencing depth, genotyping quality, and GC contents, to predict whether a particular variant is likely to be false-positive. To evaluate ForestQC, we applied it to two whole-genome sequencing datasets where one dataset consists of related individuals from families while the other consists of unrelated individuals. Results indicate that ForestQC outperforms widely used methods for performing quality control on variants such as VQSR of GATK by considerably improving the quality of variants to be included in the analysis. ForestQC is also very efficient, and hence can be applied to large sequencing datasets. We conclude that combining a machine learning algorithm trained with sequencing quality information and the filtering approach is a practical approach to perform quality control on genetic variants from sequencing data.Author summary: Genetic disorders can be caused by many types of genetic mutations, including common and rare single nucleotide variants, structural variants, insertions, and deletions. Nowadays, next-generation sequencing (NGS) technology allows us to identify various genetic variants that are associated with diseases. However, variants detected by NGS might have poor sequencing quality due to biases and errors in sequencing technologies and analysis tools. Therefore, it is critical to remove variants with low quality, which could cause spurious findings in follow-up analyses. Previously, people applied either hard filters or machine learning models for variant quality control (QC), which failed to filter out those variants accurately. Here, we developed a statistical tool, ForestQC, for variant QC by combining a filtering approach and a machine learning approach. We applied ForestQC to one family-based whole-genome sequencing (WGS) dataset and one general case-control WGS dataset, to evaluate it. Results show that ForestQC outperforms widely used methods for variant QC by considerably improving the quality of variants. Also, ForestQC is very efficient and scalable to large-scale sequencing datasets. Our study indicates that combining filtering approaches and machine learning approaches enables effective variant QC.

Suggested Citation

  • Jiajin Li & Brandon Jew & Lingyu Zhan & Sungoo Hwang & Giovanni Coppola & Nelson B Freimer & Jae Hoon Sul, 2019. "ForestQC: Quality control on genetic variants from next-generation sequencing data using random forest," PLOS Computational Biology, Public Library of Science, vol. 15(12), pages 1-30, December.
  • Handle: RePEc:plo:pcbi00:1007556
    DOI: 10.1371/journal.pcbi.1007556
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007556
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1007556&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1007556?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Robert Sladek & Ghislain Rocheleau & Johan Rung & Christian Dina & Lishuang Shen & David Serre & Philippe Boutin & Daniel Vincent & Alexandre Belisle & Samy Hadjadj & Beverley Balkau & Barbara Heude &, 2007. "A genome-wide association study identifies novel risk loci for type 2 diabetes," Nature, Nature, vol. 445(7130), pages 881-885, February.
    2. Maggie C Y Ng & Daniel Shriner & Brian H Chen & Jiang Li & Wei-Min Chen & Xiuqing Guo & Jiankang Liu & Suzette J Bielinski & Lisa R Yanek & Michael A Nalls & Mary E Comeau & Laura J Rasmussen-Torvik &, 2014. "Meta-Analysis of Genome-Wide Association Studies in African Americans Provides Insights into the Genetic Architecture of Type 2 Diabetes," PLOS Genetics, Public Library of Science, vol. 10(8), pages 1-14, August.
    3. Gareth Highnam & Jason J. Wang & Dean Kusler & Justin Zook & Vinaya Vijayan & Nir Leibovich & David Mittelman, 2015. "An analytical framework for optimizing variant discovery from personal genomes," Nature Communications, Nature, vol. 6(1), pages 1-6, May.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Ping Rao & Hao Wang & Honghong Fang & Qing Gao & Jie Zhang & Manshu Song & Yong Zhou & Youxin Wang & Wei Wang, 2016. "Association between IGF2BP2 Polymorphisms and Type 2 Diabetes Mellitus: A Case–Control Study and Meta-Analysis," IJERPH, MDPI, vol. 13(6), pages 1-13, June.
    2. Paul F O’Reilly & Clive J Hoggart & Yotsawat Pomyen & Federico C F Calboli & Paul Elliott & Marjo-Riitta Jarvelin & Lachlan J M Coin, 2012. "MultiPhen: Joint Model of Multiple Phenotypes Can Increase Discovery in GWAS," PLOS ONE, Public Library of Science, vol. 7(5), pages 1-1, May.
    3. Sarah Meulebrouck & Judith Merrheim & Gurvan Queniat & Cyril Bourouh & Mehdi Derhourhi & Mathilde Boissel & Xiaoyan Yi & Alaa Badreddine & Raphaël Boutry & Audrey Leloire & Bénédicte Toussaint & Souhi, 2024. "Functional genetics reveals the contribution of delta opioid receptor to type 2 diabetes and beta-cell function," Nature Communications, Nature, vol. 15(1), pages 1-12, December.
    4. Hongyan Mao & Qin Li & Shujun Gao, 2012. "Meta-Analysis of the Relationship between Common Type 2 Diabetes Risk Gene Variants with Gestational Diabetes Mellitus," PLOS ONE, Public Library of Science, vol. 7(9), pages 1-7, September.
    5. Ren Matsuba & Minako Imamura & Yasushi Tanaka & Minoru Iwata & Hiroshi Hirose & Kohei Kaku & Hiroshi Maegawa & Hirotaka Watada & Kazuyuki Tobe & Atsunori Kashiwagi & Ryuzo Kawamori & Shiro Maeda, 2016. "Replication Study in a Japanese Population of Six Susceptibility Loci for Type 2 Diabetes Originally Identified by a Transethnic Meta-Analysis of Genome-Wide Association Studies," PLOS ONE, Public Library of Science, vol. 11(4), pages 1-9, April.
    6. Nicholette D Palmer & Caitrin W McDonough & Pamela J Hicks & Bong H Roh & Maria R Wing & S Sandy An & Jessica M Hester & Jessica N Cooke & Meredith A Bostrom & Megan E Rudock & Matthew E Talbert & Jos, 2012. "A Genome-Wide Association Search for Type 2 Diabetes Genes in African Americans," PLOS ONE, Public Library of Science, vol. 7(1), pages 1-14, January.
    7. Inga Prokopenko & Wenny Poon & Reedik Mägi & Rashmi Prasad B & S Albert Salehi & Peter Almgren & Peter Osmark & Nabila Bouatia-Naji & Nils Wierup & Tove Fall & Alena Stančáková & Adam Barker & Vasilik, 2014. "A Central Role for GRB10 in Regulation of Islet Function in Man," PLOS Genetics, Public Library of Science, vol. 10(4), pages 1-13, April.
    8. Trine Welløv Boesgaard & Anette Prior Gjesing & Niels Grarup & Jarno Rutanen & Per-Anders Jansson & Marta Letizia Hribal & Giorgio Sesti & Andreas Fritsche & Norbert Stefan & Harald Staiger & Hans Här, 2009. "Variant near ADAMTS9 Known to Associate with Type 2 Diabetes Is Related to Insulin Resistance in Offspring of Type 2 Diabetes Patients—EUGENE2 Study," PLOS ONE, Public Library of Science, vol. 4(9), pages 1-7, September.
    9. Artak Labadzhyan & Jinrui Cui & Miklós Péterfy & Xiuqing Guo & Yii-Der I Chen & Willa A Hsueh & Jerome I Rotter & Mark O Goodarzi, 2016. "Insulin Clearance Is Associated with Hepatic Lipase Activity and Lipid and Adiposity Traits in Mexican Americans," PLOS ONE, Public Library of Science, vol. 11(11), pages 1-11, November.
    10. Ching-Yu Cheng & David Reich & Christopher A Haiman & Arti Tandon & Nick Patterson & Selvin Elizabeth & Ermeg L Akylbekova & Frederick L Brancati & Josef Coresh & Eric Boerwinkle & David Altshuler & H, 2012. "African Ancestry and Its Correlation to Type 2 Diabetes in African Americans: A Genetic Admixture Analysis in Three U.S. Population Cohorts," PLOS ONE, Public Library of Science, vol. 7(3), pages 1-9, March.
    11. Xueling Sim & Rick Twee-Hee Ong & Chen Suo & Wan-Ting Tay & Jianjun Liu & Daniel Peng-Keat Ng & Michael Boehnke & Kee-Seng Chia & Tien-Yin Wong & Mark Seielstad & Yik-Ying Teo & E-Shyong Tai, 2011. "Transferability of Type 2 Diabetes Implicated Loci in Multi-Ethnic Cohorts from Southeast Asia," PLOS Genetics, Public Library of Science, vol. 7(4), pages 1-12, April.
    12. Pasi J Eskola & Susanna Lemmelä & Per Kjaer & Svetlana Solovieva & Minna Männikkö & Niels Tommerup & Allan Lind-Thomsen & Kirsti Husgafvel-Pursiainen & Kenneth M C Cheung & Danny Chan & Dino Samartzis, 2012. "Genetic Association Studies in Lumbar Disc Degeneration: A Systematic Review," PLOS ONE, Public Library of Science, vol. 7(11), pages 1-10, November.
    13. Mengling Tang & Kun Chen & Fangxing Yang & Weiping Liu, 2014. "Exposure to Organochlorine Pollutants and Type 2 Diabetes: A Systematic Review and Meta-Analysis," PLOS ONE, Public Library of Science, vol. 9(10), pages 1-12, October.
    14. Sun, Yan V. & Jacobsen, Douglas M. & Turner, Stephen T. & Boerwinkle, Eric & Kardia, Sharon L.R., 2009. "Fast implementation of a scan statistic for identifying chromosomal patterns of genome wide association studies," Computational Statistics & Data Analysis, Elsevier, vol. 53(5), pages 1794-1801, March.
    15. Florian Mittag & Michael Römer & Andreas Zell, 2015. "Influence of Feature Encoding and Choice of Classifier on Disease Risk Prediction in Genome-Wide Association Studies," PLOS ONE, Public Library of Science, vol. 10(8), pages 1-18, August.
    16. Greve, Jane, 2008. "Obesity and labor market outcomes in Denmark," Economics & Human Biology, Elsevier, vol. 6(3), pages 350-362, December.
    17. John PA Ioannidis & Nikolaos A Patsopoulos & Evangelos Evangelou, 2007. "Heterogeneity in Meta-Analyses of Genome-Wide Association Investigations," PLOS ONE, Public Library of Science, vol. 2(9), pages 1-7, September.
    18. Sato Yasunori & Laird Nan & Suganami Hideki & Hamada Chikuma & Niki Naoto & Yoshimura Isao & Yoshida Teruhiko, 2009. "Statistical Screening Method for Genetic Factors Influencing Susceptibility to Common Diseases in a Two-Stage Genome-Wide Association Study," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 8(1), pages 1-23, November.
    19. Guang Guo, 2008. "Introduction to the Special Issue on Society and Genetics," Sociological Methods & Research, , vol. 37(2), pages 159-163, November.
    20. Peristera Paschou & Petros Drineas & Jamey Lewis & Caroline M Nievergelt & Deborah A Nickerson & Joshua D Smith & Paul M Ridker & Daniel I Chasman & Ronald M Krauss & Elad Ziv, 2008. "Tracing Sub-Structure in the European American Population with PCA-Informative Markers," PLOS Genetics, Public Library of Science, vol. 4(7), pages 1-13, July.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1007556. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.