IDEAS home Printed from https://ideas.repec.org/a/plo/pgen00/1009049.html
   My bibliography  Save this article

Genotype imputation using the Positional Burrows Wheeler Transform

Author

Listed:
  • Simone Rubinacci
  • Olivier Delaneau
  • Jonathan Marchini

Abstract

Genotype imputation is the process of predicting unobserved genotypes in a sample of individuals using a reference panel of haplotypes. In the last 10 years reference panels have increased in size by more than 100 fold. Increasing reference panel size improves accuracy of markers with low minor allele frequencies but poses ever increasing computational challenges for imputation methods. Here we present IMPUTE5, a genotype imputation method that can scale to reference panels with millions of samples. This method continues to refine the observation made in the IMPUTE2 method, that accuracy is optimized via use of a custom subset of haplotypes when imputing each individual. It achieves fast, accurate, and memory-efficient imputation by selecting haplotypes using the Positional Burrows Wheeler Transform (PBWT). By using the PBWT data structure at genotyped markers, IMPUTE5 identifies locally best matching haplotypes and long identical by state segments. The method then uses the selected haplotypes as conditioning states within the IMPUTE model. Using the HRC reference panel, which has ∼65,000 haplotypes, we show that IMPUTE5 is up to 30x faster than MINIMAC4 and up to 3x faster than BEAGLE5.1, and uses less memory than both these methods. Using simulated reference panels we show that IMPUTE5 scales sub-linearly with reference panel size. For example, keeping the number of imputed markers constant, increasing the reference panel size from 10,000 to 1 million haplotypes requires less than twice the computation time. As the reference panel increases in size IMPUTE5 is able to utilize a smaller number of reference haplotypes, thus reducing computational cost.Author summary: Genome-wide association studies (GWAS) typically use microarray technology to measure genotypes at several hundred thousand positions in the genome. However reference panels of genetic variation consist of haplotype data at >100 fold more positions in the genome. Genotype imputation makes genotype predictions at all the reference panel sites using the GWAS data. Reference panels are continuing to grow in size and this improves accuracy of the predictions, however methods need to be able to scale this increased size. We have developed a new version of the popular IMPUTE software than can handle reference panels with millions of haplotypes, and has better performance than other published approaches. A notable property of the new method is that it scales sub-linearly with reference panel size. Keeping the number of imputed markers constant, a 100 fold increase in reference panel size requires less than twice the computation time.

Suggested Citation

  • Simone Rubinacci & Olivier Delaneau & Jonathan Marchini, 2020. "Genotype imputation using the Positional Burrows Wheeler Transform," PLOS Genetics, Public Library of Science, vol. 16(11), pages 1-19, November.
  • Handle: RePEc:plo:pgen00:1009049
    DOI: 10.1371/journal.pgen.1009049
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1009049
    Download Restriction: no

    File URL: https://journals.plos.org/plosgenetics/article/file?id=10.1371/journal.pgen.1009049&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pgen.1009049?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Jerome Kelleher & Alison M Etheridge & Gilean McVean, 2016. "Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes," PLOS Computational Biology, Public Library of Science, vol. 12(5), pages 1-22, May.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Robin J. Hofmeister & Simone Rubinacci & Diogo M. Ribeiro & Alfonso Buil & Zoltán Kutalik & Olivier Delaneau, 2022. "Parent-of-Origin inference for biobanks," Nature Communications, Nature, vol. 13(1), pages 1-15, December.
    2. Xinkai Tong & Dong Chen & Jianchao Hu & Shiyao Lin & Ziqi Ling & Huashui Ai & Zhiyan Zhang & Lusheng Huang, 2023. "Accurate haplotype construction and detection of selection signatures enabled by high quality pig genome sequences," Nature Communications, Nature, vol. 14(1), pages 1-11, December.
    3. Seppe Goovaerts & Hanne Hoskens & Ryan J. Eller & Noah Herrick & Anthony M. Musolf & Cristina M. Justice & Meng Yuan & Sahin Naqvi & Myoung Keun Lee & Dirk Vandermeulen & Heather L. Szabo-Rogers & Pau, 2023. "Joint multi-ancestry and admixed GWAS reveals the complex genetics behind human cranial vault shape," Nature Communications, Nature, vol. 14(1), pages 1-21, December.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Sergio F. Nigenda-Morales & Meixi Lin & Paulina G. Nuñez-Valencia & Christopher C. Kyriazis & Annabel C. Beichman & Jacqueline A. Robinson & Aaron P. Ragsdale & Jorge Urbán R. & Frederick I. Archer & , 2023. "The genomic footprint of whaling and isolation in fin whale populations," Nature Communications, Nature, vol. 14(1), pages 1-18, December.
    2. Ralph, Peter L., 2019. "An empirical approach to demographic inference with genomic data," Theoretical Population Biology, Elsevier, vol. 127(C), pages 91-101.
    3. Zihao Wang & Wenxi Wang & Xiaoming Xie & Yongfa Wang & Zhengzhao Yang & Huiru Peng & Mingming Xin & Yingyin Yao & Zhaorong Hu & Jie Liu & Zhenqi Su & Chaojie Xie & Baoyun Li & Zhongfu Ni & Qixin Sun &, 2022. "Dispersed emergence and protracted domestication of polyploid wheat uncovered by mosaic ancestral haploblock inference," Nature Communications, Nature, vol. 13(1), pages 1-14, December.
    4. Vasili Pankratov & Milyausha Yunusbaeva & Sergei Ryakhovsky & Maksym Zarodniuk & Bayazit Yunusbayev, 2022. "Prioritizing autoimmunity risk variants for functional analyses by fine-mapping mutations under natural selection," Nature Communications, Nature, vol. 13(1), pages 1-13, December.
    5. Michael DeGiorgio & Zachary A Szpiech, 2022. "A spatially aware likelihood test to detect sweeps from haplotype distributions," PLOS Genetics, Public Library of Science, vol. 18(4), pages 1-37, April.
    6. Ali Mahmoudi & Jere Koskela & Jerome Kelleher & Yao-ban Chan & David Balding, 2022. "Bayesian inference of ancestral recombination graphs," PLOS Computational Biology, Public Library of Science, vol. 18(3), pages 1-15, March.
    7. Kerdoncuff, Elise & Lambert, Amaury & Achaz, Guillaume, 2020. "Testing for population decline using maximal linkage disequilibrium blocks," Theoretical Population Biology, Elsevier, vol. 134(C), pages 171-181.
    8. Parul Johri & Wolfgang Stephan & Jeffrey D Jensen, 2022. "Soft selective sweeps: Addressing new definitions, evaluating competing models, and interpreting empirical outliers," PLOS Genetics, Public Library of Science, vol. 18(2), pages 1-12, February.
    9. Andrea Fulgione & Célia Neto & Ahmed F. Elfarargi & Emmanuel Tergemina & Shifa Ansari & Mehmet Göktay & Herculano Dinis & Nina Döring & Pádraic J. Flood & Sofia Rodriguez-Pacheco & Nora Walden & Marcu, 2022. "Parallel reduction in flowering time from de novo mutations enable evolutionary rescue in colonizing lineages," Nature Communications, Nature, vol. 13(1), pages 1-14, December.
    10. Miró Pina, Verónica & Joly, Émilien & Siri-Jégousse, Arno, 2023. "Estimating the Lambda measure in multiple-merger coalescents," Theoretical Population Biology, Elsevier, vol. 154(C), pages 94-101.
    11. Sam Tallman & Maria das Dores Sungo & Sílvio Saranga & Sandra Beleza, 2023. "Whole genomes from Angola and Mozambique inform about the origins and dispersals of major African migrations," Nature Communications, Nature, vol. 14(1), pages 1-14, December.
    12. Victoria L. Sork & Shawn J. Cokus & Sorel T. Fitz-Gibbon & Aleksey V. Zimin & Daniela Puiu & Jesse A. Garcia & Paul F. Gugger & Claudia L. Henriquez & Ying Zhen & Kirk E. Lohmueller & Matteo Pellegrin, 2022. "High-quality genome and methylomes illustrate features underlying evolutionary success of oaks," Nature Communications, Nature, vol. 13(1), pages 1-15, December.
    13. Max Lundberg & Alexander Mackintosh & Anna Petri & Staffan Bensch, 2023. "Inversions maintain differences between migratory phenotypes of a songbird," Nature Communications, Nature, vol. 14(1), pages 1-15, December.
    14. Jerome Kelleher & Kevin R Thornton & Jaime Ashander & Peter L Ralph, 2018. "Efficient pedigree recording for fast population genetics simulation," PLOS Computational Biology, Public Library of Science, vol. 14(11), pages 1-21, November.
    15. Deng, Yun & Song, Yun S. & Nielsen, Rasmus, 2021. "The distribution of waiting distances in ancestral recombination graphs," Theoretical Population Biology, Elsevier, vol. 141(C), pages 34-43.
    16. Brieuc Lehmann & Maxine Mackintosh & Gil McVean & Chris Holmes, 2023. "Optimal strategies for learning multi-ancestry polygenic scores vary across traits," Nature Communications, Nature, vol. 14(1), pages 1-15, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pgen00:1009049. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosgenetics (email available below). General contact details of provider: https://journals.plos.org/plosgenetics/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.