Author
Listed:
- Qingyu Chen
(Yale University
National Institutes of Health)
- Yan Hu
(University of Texas Health Science at Houston)
- Xueqing Peng
(Yale University)
- Qianqian Xie
(Yale University)
- Qiao Jin
(National Institutes of Health)
- Aidan Gilson
(Yale University)
- Maxwell B. Singer
(Yale University)
- Xuguang Ai
(Yale University)
- Po-Ting Lai
(National Institutes of Health)
- Zhizheng Wang
(National Institutes of Health)
- Vipina K. Keloth
(Yale University)
- Kalpana Raja
(Yale University)
- Jimin Huang
(Yale University)
- Huan He
(Yale University)
- Fongci Lin
(Yale University)
- Jingcheng Du
(University of Texas Health Science at Houston)
- Rui Zhang
(University of Minnesota
University of Minnesota)
- W. Jim Zheng
(University of Texas Health Science at Houston)
- Ron A. Adelman
(Yale University)
- Zhiyong Lu
(National Institutes of Health)
- Hua Xu
(Yale University)
Abstract
The rapid growth of biomedical literature poses challenges for manual knowledge curation and synthesis. Biomedical Natural Language Processing (BioNLP) automates the process. While Large Language Models (LLMs) have shown promise in general domains, their effectiveness in BioNLP tasks remains unclear due to limited benchmarks and practical guidelines. We perform a systematic evaluation of four LLMs—GPT and LLaMA representatives—on 12 BioNLP benchmarks across six applications. We compare their zero-shot, few-shot, and fine-tuning performance with the traditional fine-tuning of BERT or BART models. We examine inconsistencies, missing information, hallucinations, and perform cost analysis. Here, we show that traditional fine-tuning outperforms zero- or few-shot LLMs in most tasks. However, closed-source LLMs like GPT-4 excel in reasoning-related tasks such as medical question answering. Open-source LLMs still require fine-tuning to close performance gaps. We find issues like missing information and hallucinations in LLM outputs. These results offer practical insights for applying LLMs in BioNLP.
Suggested Citation
Qingyu Chen & Yan Hu & Xueqing Peng & Qianqian Xie & Qiao Jin & Aidan Gilson & Maxwell B. Singer & Xuguang Ai & Po-Ting Lai & Zhizheng Wang & Vipina K. Keloth & Kalpana Raja & Jimin Huang & Huan He & , 2025.
"Benchmarking large language models for biomedical natural language processing applications and recommendations,"
Nature Communications, Nature, vol. 16(1), pages 1-16, December.
Handle:
RePEc:nat:natcom:v:16:y:2025:i:1:d:10.1038_s41467-025-56989-2
DOI: 10.1038/s41467-025-56989-2
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:nat:natcom:v:16:y:2025:i:1:d:10.1038_s41467-025-56989-2. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.nature.com .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.