IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1005107.html
   My bibliography  Save this article

rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison

Author

Listed:
  • Lars Hahn
  • Chris-André Leimeister
  • Rachid Ounit
  • Stefano Lonardi
  • Burkhard Morgenstern

Abstract

Many algorithms for sequence analysis rely on word matching or word statistics. Often, these approaches can be improved if binary patterns representing match and don’t-care positions are used as a filter, such that only those positions of words are considered that correspond to the match positions of the patterns. The performance of these approaches, however, depends on the underlying patterns. Herein, we show that the overlap complexity of a pattern set that was introduced by Ilie and Ilie is closely related to the variance of the number of matches between two evolutionarily related sequences with respect to this pattern set. We propose a modified hill-climbing algorithm to optimize pattern sets for database searching, read mapping and alignment-free sequence comparison of nucleic-acid sequences; our implementation of this algorithm is called rasbhari. Depending on the application at hand, rasbhari can either minimize the overlap complexity of pattern sets, maximize their sensitivity in database searching or minimize the variance of the number of pattern-based matches in alignment-free sequence comparison. We show that, for database searching, rasbhari generates pattern sets with slightly higher sensitivity than existing approaches. In our Spaced Words approach to alignment-free sequence comparison, pattern sets calculated with rasbhari led to more accurate estimates of phylogenetic distances than the randomly generated pattern sets that we previously used. Finally, we used rasbhari to generate patterns for short read classification with CLARK-S. Here too, the sensitivity of the results could be improved, compared to the default patterns of the program. We integrated rasbhari into Spaced Words; the source code of rasbhari is freely available at http://rasbhari.gobics.de/Author Summary: We propose a fast algorithm to generate spaced seeds for database searching, read mapping and alignment-free sequence comparison. Spaced seeds—i.e. patterns of match and don’t-care positions—are used by many algorithms for sequence analysis; designing optimal seeds is therefore an active field of research. In sequence-database searching, one wants to optimize sensitivity, i.e. the probability of finding a region of homology; this can be done by minimizing the so-called overlap complexity of pattern sets. In alignment-free DNA sequence comparison, the number N of pattern-based matches is used to estimate phylogenetic distances. Here, one wants to minimize the variance of N in order to obtain stable phylogenies. We show that for spaced seeds, the overlap complexity—and therefore the sensitivity in database searching—is closely related to the variance of N. Our algorithm can optimize the sensitivity, overlap complexity or the variance of N, depending on the application at hand.

Suggested Citation

  • Lars Hahn & Chris-André Leimeister & Rachid Ounit & Stefano Lonardi & Burkhard Morgenstern, 2016. "rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison," PLOS Computational Biology, Public Library of Science, vol. 12(10), pages 1-18, October.
  • Handle: RePEc:plo:pcbi00:1005107
    DOI: 10.1371/journal.pcbi.1005107
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005107
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1005107&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1005107?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Stephen M Rumble & Phil Lacroute & Adrian V Dalca & Marc Fiume & Arend Sidow & Michael Brudno, 2009. "SHRiMP: Accurate Mapping of Short Color-space Reads," PLOS Computational Biology, Public Library of Science, vol. 5(5), pages 1-11, May.
    2. Nils Homer & Barry Merriman & Stanley F Nelson, 2009. "BFAST: An Alignment Tool for Large Scale Genome Resequencing," PLOS ONE, Public Library of Science, vol. 4(11), pages 1-12, November.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Meznah Almutairy & Eric Torng, 2017. "The effects of sampling on the efficiency and accuracy of k−mer indexes: Theoretical and empirical comparisons using the human genome," PLOS ONE, Public Library of Science, vol. 12(7), pages 1-23, July.
    2. Joshua C Bis & Anita DeStefano & Xiaoming Liu & Jennifer A Brody & Seung Hoan Choi & Benjamin F J Verhaaren & Stéphanie Debette & M Arfan Ikram & Eyal Shahar & Kenneth R Butler Jr & Rebecca F Gottesma, 2014. "Associations of NINJ2 Sequence Variants with Incident Ischemic Stroke in the Cohorts for Heart and Aging in Genomic Epidemiology (CHARGE) Consortium," PLOS ONE, Public Library of Science, vol. 9(6), pages 1-7, June.
    3. Afonso R. M. Almeida & João L. Neto & Ana Cachucho & Mayara Euzébio & Xiangyu Meng & Rathana Kim & Marta B. Fernandes & Beatriz Raposo & Mariana L. Oliveira & Daniel Ribeiro & Rita Fragoso & Priscila , 2021. "Interleukin-7 receptor α mutational activation can initiate precursor B-cell acute lymphoblastic leukemia," Nature Communications, Nature, vol. 12(1), pages 1-16, December.
    4. Zheng Sun & Weidong Tian, 2012. "SAP—A Sequence Mapping and Analyzing Program for Long Sequence Reads Alignment and Accurate Variants Discovery," PLOS ONE, Public Library of Science, vol. 7(8), pages 1-6, August.
    5. Swetansu Pattnaik & Srividya Vaidyanathan & Durgad G Pooja & Sa Deepak & Binay Panda, 2012. "Customisation of the Exome Data Analysis Pipeline Using a Combinatorial Approach," PLOS ONE, Public Library of Science, vol. 7(1), pages 1-9, January.
    6. Le’an Qu & Zhenjie Chen & Manchun Li, 2019. "CART-RF Classification with Multifilter for Monitoring Land Use Changes Based on MODIS Time-Series Data: A Case Study from Jiangsu Province, China," Sustainability, MDPI, vol. 11(20), pages 1-23, October.
    7. Francesca Cordero & Marco Beccuti & Maddalena Arigoni & Susanna Donatelli & Raffaele A Calogero, 2012. "Optimizing a Massive Parallel Sequencing Workflow for Quantitative miRNA Expression Analysis," PLOS ONE, Public Library of Science, vol. 7(2), pages 1-10, February.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1005107. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.