IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0220182.html
   My bibliography  Save this article

rawMSA: End-to-end Deep Learning using raw Multiple Sequence Alignments

Author

Listed:
  • Claudio Mirabello
  • Björn Wallner

Abstract

In the last decades, huge efforts have been made in the bioinformatics community to develop machine learning-based methods for the prediction of structural features of proteins in the hope of answering fundamental questions about the way proteins function and their involvement in several illnesses. The recent advent of Deep Learning has renewed the interest in neural networks, with dozens of methods being developed taking advantage of these new architectures. However, most methods are still heavily based pre-processing of the input data, as well as extraction and integration of multiple hand-picked, and manually designed features. Multiple Sequence Alignments (MSA) are the most common source of information in de novo prediction methods. Deep Networks that automatically refine the MSA and extract useful features from it would be immensely powerful. In this work, we propose a new paradigm for the prediction of protein structural features called rawMSA. The core idea behind rawMSA is borrowed from the field of natural language processing to map amino acid sequences into an adaptively learned continuous space. This allows the whole MSA to be input into a Deep Network, thus rendering pre-calculated features such as sequence profiles and other features calculated from MSA obsolete. We showcased the rawMSA methodology on three different prediction problems: secondary structure, relative solvent accessibility and inter-residue contact maps. We have rigorously trained and benchmarked rawMSA on a large set of proteins and have determined that it outperforms classical methods based on position-specific scoring matrices (PSSM) when predicting secondary structure and solvent accessibility, while performing on par with methods using more pre-calculated features in the inter-residue contact map prediction category in CASP12 and CASP13. Clearly demonstrating that rawMSA represents a promising development that can pave the way for improved methods using rawMSA instead of sequence profiles to represent evolutionary information in the coming years. Availability: datasets, dataset generation code, evaluation code and models are available at: https://bitbucket.org/clami66/rawmsa.

Suggested Citation

  • Claudio Mirabello & Björn Wallner, 2019. "rawMSA: End-to-end Deep Learning using raw Multiple Sequence Alignments," PLOS ONE, Public Library of Science, vol. 14(8), pages 1-15, August.
  • Handle: RePEc:plo:pone00:0220182
    DOI: 10.1371/journal.pone.0220182
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0220182
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0220182&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0220182?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Sheng Wang & Siqi Sun & Zhen Li & Renyu Zhang & Jinbo Xu, 2017. "Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model," PLOS Computational Biology, Public Library of Science, vol. 13(1), pages 1-34, January.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Nicolae Sapoval & Amirali Aghazadeh & Michael G. Nute & Dinler A. Antunes & Advait Balaji & Richard Baraniuk & C. J. Barberan & Ruth Dannenfelser & Chen Dun & Mohammadamin Edrisi & R. A. Leo Elworth &, 2022. "Current progress and open challenges for applying deep learning across the biosciences," Nature Communications, Nature, vol. 13(1), pages 1-12, December.
    2. Kabir, Md Wasi Ul & Hoque, Md Tamjidul, 2024. "DisPredict3.0: Prediction of intrinsically disordered regions/proteins using protein language model," Applied Mathematics and Computation, Elsevier, vol. 472(C).

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Rui Fa & Domenico Cozzetto & Cen Wan & David T Jones, 2018. "Predicting human protein function with multi-task deep neural networks," PLOS ONE, Public Library of Science, vol. 13(6), pages 1-16, June.
    2. Peicong Lin & Yumeng Yan & Huanyu Tao & Sheng-You Huang, 2023. "Deep transfer learning for inter-chain contact predictions of transmembrane protein complexes," Nature Communications, Nature, vol. 14(1), pages 1-16, December.
    3. Nicolae Sapoval & Amirali Aghazadeh & Michael G. Nute & Dinler A. Antunes & Advait Balaji & Richard Baraniuk & C. J. Barberan & Ruth Dannenfelser & Chen Dun & Mohammadamin Edrisi & R. A. Leo Elworth &, 2022. "Current progress and open challenges for applying deep learning across the biosciences," Nature Communications, Nature, vol. 13(1), pages 1-12, December.
    4. Rahmatullah Roche & Sutanu Bhattacharya & Debswapna Bhattacharya, 2021. "Hybridized distance- and contact-based hierarchical structure modeling for folding soluble and membrane proteins," PLOS Computational Biology, Public Library of Science, vol. 17(2), pages 1-31, February.
    5. Shuangxi Ji & Tuğçe Oruç & Liam Mead & Muhammad Fayyaz Rehman & Christopher Morton Thomas & Sam Butterworth & Peter James Winn, 2019. "DeepCDpred: Inter-residue distance and contact prediction for improved prediction of protein structure," PLOS ONE, Public Library of Science, vol. 14(1), pages 1-15, January.
    6. Juan A Morales-Cordovilla & Victoria Sanchez & Martin Ratajczak, 2018. "Protein alignment based on higher order conditional random fields for template-based modeling," PLOS ONE, Public Library of Science, vol. 13(6), pages 1-14, June.
    7. Shivangi & Laxman S Meena & Md Amjad Beg, 2018. "Insights of Rv2921c (Ftsy) Gene of Mycobacterium tuberculosis H37Rv To Prove Its Significance by Computational Approach," Biomedical Journal of Scientific & Technical Research, Biomedical Research Network+, LLC, vol. 12(2), pages 9147-9157, December.
    8. Yang Li & Chengxin Zhang & Eric W Bell & Wei Zheng & Xiaogen Zhou & Dong-Jun Yu & Yang Zhang, 2021. "Deducing high-accuracy protein contact-maps from a triplet of coevolutionary matrices through deep residual convolutional networks," PLOS Computational Biology, Public Library of Science, vol. 17(3), pages 1-19, March.
    9. Lei Wang & Jiangguo Zhang & Dali Wang & Chen Song, 2022. "Membrane contact probability: An essential and predictive character for the structural and functional studies of membrane proteins," PLOS Computational Biology, Public Library of Science, vol. 18(3), pages 1-27, March.
    10. Zhiye Guo & Jian Liu & Jeffrey Skolnick & Jianlin Cheng, 2022. "Prediction of inter-chain distance maps of protein complexes with 2D attention-based deep neural networks," Nature Communications, Nature, vol. 13(1), pages 1-10, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0220182. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.