IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1003737.html
   My bibliography  Save this article

A Scalable and Accurate Targeted Gene Assembly Tool (SAT-Assembler) for Next-Generation Sequencing Data

Author

Listed:
  • Yuan Zhang
  • Yanni Sun
  • James R Cole

Abstract

Gene assembly, which recovers gene segments from short reads, is an important step in functional analysis of next-generation sequencing data. Lacking quality reference genomes, de novo assembly is commonly used for RNA-Seq data of non-model organisms and metagenomic data. However, heterogeneous sequence coverage caused by heterogeneous expression or species abundance, similarity between isoforms or homologous genes, and large data size all pose challenges to de novo assembly. As a result, existing assembly tools tend to output fragmented contigs or chimeric contigs, or have high memory footprint. In this work, we introduce a targeted gene assembly program SAT-Assembler, which aims to recover gene families of particular interest to biologists. It addresses the above challenges by conducting family-specific homology search, homology-guided overlap graph construction, and careful graph traversal. It can be applied to both RNA-Seq and metagenomic data. Our experimental results on an Arabidopsis RNA-Seq data set and two metagenomic data sets show that SAT-Assembler has smaller memory usage, comparable or better gene coverage, and lower chimera rate for assembling a set of genes from one or multiple pathways compared with other assembly tools. Moreover, the family-specific design and rapid homology search allow SAT-Assembler to be naturally compatible with parallel computing platforms. The source code of SAT-Assembler is available at https://sourceforge.net/projects/sat-assembler/. The data sets and experimental settings can be found in supplementary material.Author Summary: Next-generation sequencing (NGS) provides an efficient and affordable way to sequence the genomes or transcriptomes of a large amount of organisms. With fast accumulation of the sequencing data from various NGS projects, the bottleneck is to efficiently mine useful knowledge from the data. As NGS platforms usually generate short and fragmented sequences (reads), one key step to annotate NGS data is to assemble short reads into longer contigs, which are then used to recover functional elements such as protein-coding genes. Short read assembly remains one of the most difficult computational problems in genomics. In particular, the performance of existing assembly tools is not satisfactory on complicated NGS data sets. They cannot reliably separate genes of high similarity, recover under-represented genes, and incur high computational time and memory usage. Hence, we propose a targeted gene assembly tool, SAT-Assembler, to assemble genes of interest directly from NGS data with low memory usage and high accuracy. Our experimental results on a transcriptomic data set and two microbial community data sets showed that SAT-Assembler used less memory and recovered more target genes with better accuracy than existing tools.

Suggested Citation

  • Yuan Zhang & Yanni Sun & James R Cole, 2014. "A Scalable and Accurate Targeted Gene Assembly Tool (SAT-Assembler) for Next-Generation Sequencing Data," PLOS Computational Biology, Public Library of Science, vol. 10(8), pages 1-16, August.
  • Handle: RePEc:plo:pcbi00:1003737
    DOI: 10.1371/journal.pcbi.1003737
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003737
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1003737&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1003737?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Rachel Mackelprang & Mark P. Waldrop & Kristen M. DeAngelis & Maude M. David & Krystle L. Chavarria & Steven J. Blazewicz & Edward M. Rubin & Janet K. Jansson, 2011. "Metagenomic analysis of a permafrost microbial community reveals a rapid response to thaw," Nature, Nature, vol. 480(7377), pages 368-371, December.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Daniel H Huson & Sina Beier & Isabell Flade & Anna Górska & Mohamed El-Hadidi & Suparna Mitra & Hans-Joachim Ruscheweyh & Rewati Tappu, 2016. "MEGAN Community Edition - Interactive Exploration and Analysis of Large-Scale Microbiome Sequencing Data," PLOS Computational Biology, Public Library of Science, vol. 12(6), pages 1-12, June.
    2. M. E. Marushchak & J. Kerttula & K. Diáková & A. Faguet & J. Gil & G. Grosse & C. Knoblauch & N. Lashchinskiy & P. J. Martikainen & A. Morgenstern & M. Nykamb & J. G. Ronkainen & H. M. P. Siljanen & L, 2021. "Thawing Yedoma permafrost is a neglected nitrous oxide source," Nature Communications, Nature, vol. 12(1), pages 1-10, December.
    3. Dazhi Jiao & Yuzhen Ye & Haixu Tang, 2013. "Probabilistic Inference of Biochemical Reactions in Microbial Communities from Metagenomic Sequences," PLOS Computational Biology, Public Library of Science, vol. 9(3), pages 1-11, March.
    4. Xie, Xian-Hua & Yu, Zu-Guo & Ma, Yuan-Lin & Han, Guo-Sheng & Anh, Vo, 2017. "A novel genome signature based on inter-nucleotide distances profiles for visualization of metagenomic data," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 482(C), pages 87-94.
    5. Sofia Rigou & Sébastien Santini & Chantal Abergel & Jean-Michel Claverie & Matthieu Legendre, 2022. "Past and present giant viruses diversity explored through permafrost metagenomics," Nature Communications, Nature, vol. 13(1), pages 1-13, December.
    6. Xiaoqian Li & Jianwei Xing & Shouji Pang & Youhai Zhu & Shuai Zhang & Rui Xiao & Cheng Lu, 2022. "Carbon Isotopic Evidence for Gas Hydrate Release and Its Significance on Seasonal Wetland Methane Emission in the Muli Permafrost of the Qinghai-Tibet Plateau," IJERPH, MDPI, vol. 19(4), pages 1-14, February.
    7. Anzhou Ma & Jiejie Zhang & Guohua Liu & Xuliang Zhuang & Guoqiang Zhuang, 2022. "Cryosphere Microbiome Biobanks for Mountain Glaciers in China," Sustainability, MDPI, vol. 14(5), pages 1-18, March.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1003737. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.