IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1005184.html
   My bibliography  Save this article

NullSeq: A Tool for Generating Random Coding Sequences with Desired Amino Acid and GC Contents

Author

Listed:
  • Sophia S Liu
  • Adam J Hockenberry
  • Andrea Lancichinetti
  • Michael C Jewett
  • Luís A N Amaral

Abstract

The existence of over- and under-represented sequence motifs in genomes provides evidence of selective evolutionary pressures on biological mechanisms such as transcription, translation, ligand-substrate binding, and host immunity. In order to accurately identify motifs and other genome-scale patterns of interest, it is essential to be able to generate accurate null models that are appropriate for the sequences under study. While many tools have been developed to create random nucleotide sequences, protein coding sequences are subject to a unique set of constraints that complicates the process of generating appropriate null models. There are currently no tools available that allow users to create random coding sequences with specified amino acid composition and GC content for the purpose of hypothesis testing. Using the principle of maximum entropy, we developed a method that generates unbiased random sequences with pre-specified amino acid and GC content, which we have developed into a python package. Our method is the simplest way to obtain maximally unbiased random sequences that are subject to GC usage and primary amino acid sequence constraints. Furthermore, this approach can easily be expanded to create unbiased random sequences that incorporate more complicated constraints such as individual nucleotide usage or even di-nucleotide frequencies. The ability to generate correctly specified null models will allow researchers to accurately identify sequence motifs which will lead to a better understanding of biological processes as well as more effective engineering of biological systems.Author Summary: The generation of random sequences is instrumental to the accurate identification of non-random motifs within genomes, yet there are currently no tools available that allow users to simultaneously specify amino acid and GC composition to create random coding sequences. Here, we develop an algorithm based on maximum entropy that consistently generates fully random nucleotide sequences with the desired amino acid composition and GC content.

Suggested Citation

  • Sophia S Liu & Adam J Hockenberry & Andrea Lancichinetti & Michael C Jewett & Luís A N Amaral, 2016. "NullSeq: A Tool for Generating Random Coding Sequences with Desired Amino Acid and GC Contents," PLOS Computational Biology, Public Library of Science, vol. 12(11), pages 1-12, November.
  • Handle: RePEc:plo:pcbi00:1005184
    DOI: 10.1371/journal.pcbi.1005184
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005184
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1005184&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1005184?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Erik van Nimwegen, 2016. "Inferring Contacting Residues within and between Proteins: What Do the Probabilities Mean?," PLOS Computational Biology, Public Library of Science, vol. 12(5), pages 1-10, May.
    2. Jon Bohlin & Ola Brynildsrud & Tammi Vesth & Eystein Skjerve & David W Ussery, 2013. "Amino Acid Usage Is Asymmetrically Biased in AT- and GC-Rich Microbial Genomes," PLOS ONE, Public Library of Science, vol. 8(7), pages 1-10, July.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Erik Aurell, 2016. "The Maximum Entropy Fallacy Redux?," PLOS Computational Biology, Public Library of Science, vol. 12(5), pages 1-7, May.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1005184. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.