IDEAS home Printed from https://ideas.repec.org/a/nat/natcom/v15y2024i1d10.1038_s41467-024-52417-z.html
   My bibliography  Save this article

Towards building multilingual language model for medicine

Author

Listed:
  • Pengcheng Qiu

    (Shanghai Jiao Tong University
    Shanghai AI Laboratory)

  • Chaoyi Wu

    (Shanghai Jiao Tong University
    Shanghai AI Laboratory)

  • Xiaoman Zhang

    (Shanghai Jiao Tong University
    Shanghai AI Laboratory)

  • Weixiong Lin

    (Shanghai Jiao Tong University
    Shanghai AI Laboratory)

  • Haicheng Wang

    (Shanghai Jiao Tong University)

  • Ya Zhang

    (Shanghai Jiao Tong University
    Shanghai AI Laboratory)

  • Yanfeng Wang

    (Shanghai Jiao Tong University
    Shanghai AI Laboratory)

  • Weidi Xie

    (Shanghai Jiao Tong University
    Shanghai AI Laboratory)

Abstract

The development of open-source, multilingual medical language models can benefit a wide, linguistically diverse audience from different regions. To promote this domain, we present contributions from the following: First, we construct a multilingual medical corpus, containing approximately 25.5B tokens encompassing 6 main languages, termed as MMedC, enabling auto-regressive domain adaptation for general LLMs; Second, to monitor the development of multilingual medical LLMs, we propose a multilingual medical multi-choice question-answering benchmark with rationale, termed as MMedBench; Third, we have assessed a number of open-source large language models (LLMs) on our benchmark, along with those further auto-regressive trained on MMedC. Our final model, MMed-Llama 3, with only 8B parameters, achieves superior performance compared to all other open-source models on both MMedBench and English benchmarks, even rivaling GPT-4. In conclusion, in this work, We present a large-scale corpus, a benchmark and a series of models to support the development of multilingual medical LLMs.

Suggested Citation

  • Pengcheng Qiu & Chaoyi Wu & Xiaoman Zhang & Weixiong Lin & Haicheng Wang & Ya Zhang & Yanfeng Wang & Weidi Xie, 2024. "Towards building multilingual language model for medicine," Nature Communications, Nature, vol. 15(1), pages 1-15, December.
  • Handle: RePEc:nat:natcom:v:15:y:2024:i:1:d:10.1038_s41467-024-52417-z
    DOI: 10.1038/s41467-024-52417-z
    as

    Download full text from publisher

    File URL: https://www.nature.com/articles/s41467-024-52417-z
    File Function: Abstract
    Download Restriction: no

    File URL: https://libkey.io/10.1038/s41467-024-52417-z?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Michael Moor & Oishi Banerjee & Zahra Shakeri Hossein Abad & Harlan M. Krumholz & Jure Leskovec & Eric J. Topol & Pranav Rajpurkar, 2023. "Foundation models for generalist medical artificial intelligence," Nature, Nature, vol. 616(7956), pages 259-265, April.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Maksim Makarenko & Arturo Burguete-Lopez & Qizhou Wang & Silvio Giancola & Bernard Ghanem & Luca Passone & Andrea Fratalocchi, 2024. "Hardware-accelerated integrated optoelectronic platform towards real-time high-resolution hyperspectral video understanding," Nature Communications, Nature, vol. 15(1), pages 1-12, December.
    2. Junwei Cheng & Chaoran Huang & Jialong Zhang & Bo Wu & Wenkai Zhang & Xinyu Liu & Jiahui Zhang & Yiyi Tang & Hailong Zhou & Qiming Zhang & Min Gu & Jianji Dong & Xinliang Zhang, 2024. "Multimodal deep learning using on-chip diffractive optics with in situ training capability," Nature Communications, Nature, vol. 15(1), pages 1-10, December.
    3. Soroosh Tayebi Arasteh & Tianyu Han & Mahshad Lotfinia & Christiane Kuhl & Jakob Nikolas Kather & Daniel Truhn & Sven Nebelung, 2024. "Large language models streamline automated machine learning for clinical studies," Nature Communications, Nature, vol. 15(1), pages 1-12, December.
    4. Weijian Huang & Cheng Li & Hong-Yu Zhou & Hao Yang & Jiarun Liu & Yong Liang & Hairong Zheng & Shaoting Zhang & Shanshan Wang, 2024. "Enhancing representation in radiography-reports foundation model: a granular alignment algorithm using masked contrastive learning," Nature Communications, Nature, vol. 15(1), pages 1-12, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:nat:natcom:v:15:y:2024:i:1:d:10.1038_s41467-024-52417-z. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.nature.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.