IDEAS home Printed from https://ideas.repec.org/a/nat/natcom/v13y2022i1d10.1038_s41467-022-34435-x.html
   My bibliography  Save this article

Systematic tissue annotations of genomics samples by modeling unstructured metadata

Author

Listed:
  • Nathaniel T. Hawkins

    (Michigan State University)

  • Marc Maldaver

    (Michigan State University)

  • Anna Yannakopoulos

    (Michigan State University)

  • Lindsay A. Guare

    (Michigan State University
    Michigan State University
    Michigan State University)

  • Arjun Krishnan

    (Michigan State University
    Michigan State University
    University of Colorado Anschutz Medical Campus)

Abstract

There are currently >1.3 million human –omics samples that are publicly available. This valuable resource remains acutely underused because discovering particular samples from this ever-growing data collection remains a significant challenge. The major impediment is that sample attributes are routinely described using varied terminologies written in unstructured natural language. We propose a natural-language-processing-based machine learning approach (NLP-ML) to infer tissue and cell-type annotations for genomics samples based only on their free-text metadata. NLP-ML works by creating numerical representations of sample descriptions and using these representations as features in a supervised learning classifier that predicts tissue/cell-type terms. Our approach significantly outperforms an advanced graph-based reasoning annotation method (MetaSRA) and a baseline exact string matching method (TAGGER). Model similarities between related tissues demonstrate that NLP-ML models capture biologically-meaningful signals in text. Additionally, these models correctly classify tissue-associated biological processes and diseases based on their text descriptions alone. NLP-ML models are nearly as accurate as models based on gene-expression profiles in predicting sample tissue annotations but have the distinct capability to classify samples irrespective of the genomics experiment type based on their text metadata. Python NLP-ML prediction code and trained tissue models are available at https://github.com/krishnanlab/txt2onto .

Suggested Citation

  • Nathaniel T. Hawkins & Marc Maldaver & Anna Yannakopoulos & Lindsay A. Guare & Arjun Krishnan, 2022. "Systematic tissue annotations of genomics samples by modeling unstructured metadata," Nature Communications, Nature, vol. 13(1), pages 1-13, December.
  • Handle: RePEc:nat:natcom:v:13:y:2022:i:1:d:10.1038_s41467-022-34435-x
    DOI: 10.1038/s41467-022-34435-x
    as

    Download full text from publisher

    File URL: https://www.nature.com/articles/s41467-022-34435-x
    File Function: Abstract
    Download Restriction: no

    File URL: https://libkey.io/10.1038/s41467-022-34435-x?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Zichen Wang & Caroline D. Monteiro & Kathleen M. Jagodnik & Nicolas F. Fernandez & Gregory W. Gundersen & Andrew D. Rouillard & Sherry L. Jenkins & Axel S. Feldmann & Kevin S. Hu & Michael G. McDermot, 2016. "Extraction and analysis of signatures from the Gene Expression Omnibus by the crowd," Nature Communications, Nature, vol. 7(1), pages 1-11, November.
    2. Yasset Perez-Riverol & Andrey Zorin & Gaurhari Dass & Manh-Tu Vu & Pan Xu & Mihai Glont & Juan Antonio Vizcaíno & Andrew F. Jarnuczak & Robert Petryszak & Peipei Ping & Henning Hermjakob, 2019. "Quantifying the impact of public omics data," Nature Communications, Nature, vol. 10(1), pages 1-10, December.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Marcin Pilarczyk & Mehdi Fazel-Najafabadi & Michal Kouril & Behrouz Shamsaei & Juozas Vasiliauskas & Wen Niu & Naim Mahi & Lixia Zhang & Nicholas A. Clark & Yan Ren & Shana White & Rashid Karim & Huan, 2022. "Connecting omics signatures and revealing biological mechanisms with iLINCS," Nature Communications, Nature, vol. 13(1), pages 1-13, December.
    2. Tine Claeys & Tim Van Den Bossche & Yasset Perez-Riverol & Kris Gevaert & Juan Antonio Vizcaíno & Lennart Martens, 2023. "lesSDRF is more: maximizing the value of proteomics data through streamlined metadata annotation," Nature Communications, Nature, vol. 14(1), pages 1-4, December.
    3. Xiaoning Qi & Lianhe Zhao & Chenyu Tian & Yueyue Li & Zhen-Lin Chen & Peipei Huo & Runsheng Chen & Xiaodong Liu & Baoping Wan & Shengyong Yang & Yi Zhao, 2024. "Predicting transcriptional responses to novel chemical perturbations using deep generative model for drug discovery," Nature Communications, Nature, vol. 15(1), pages 1-19, December.
    4. Joshua Borycz & Robert Olendorf & Alison Specht & Bruce Grant & Kevin Crowston & Carol Tenopir & Suzie Allard & Natalie M. Rice & Rachael Hu & Robert J. Sandusky, 2023. "Perceived benefits of open data are improving but scientists still lack resources, skills, and rewards," Palgrave Communications, Palgrave Macmillan, vol. 10(1), pages 1-12, December.
    5. Mohieddin Jafari & Mehdi Mirzaie & Jie Bao & Farnaz Barneh & Shuyu Zheng & Johanna Eriksson & Caroline A. Heckman & Jing Tang, 2022. "Bipartite network models to design combination therapies in acute myeloid leukaemia," Nature Communications, Nature, vol. 13(1), pages 1-12, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:nat:natcom:v:13:y:2022:i:1:d:10.1038_s41467-022-34435-x. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.nature.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.