IDEAS home Printed from https://ideas.repec.org/a/bla/jamist/v60y2009i12p2530-2539.html
   My bibliography  Save this article

Comparing a rule‐based versus statistical system for automatic categorization of MEDLINE documents according to biomedical specialty

Author

Listed:
  • Susanne M. Humphrey
  • Aurélie Névéol
  • Allen Browne
  • Julien Gobeil
  • Patrick Ruch
  • Stéfan J. Darmoni

Abstract

Automatic document categorization is an important research problem in Information Science and Natural Language Processing. Many applications, including, Word Sense Disambiguation and Information Retrieval in large collections, can benefit from such categorization. This paper focuses on automatic categorization of documents from the biomedical literature into broad discipline‐based categories. Two different systems are described and contrasted: CISMeF, which uses rules based on human indexing of the documents by the Medical Subject Headings (MeSH) controlled vocabulary in order to assign metaterms (MTs), and Journal Descriptor Indexing (JDI), based on human categorization of about 4,000 journals and statistical associations between journal descriptors (JDs) and textwords in the documents. We evaluate and compare the performance of these systems against a gold standard of humanly assigned categories for 100 MEDLINE documents, using six measures selected from trec_eval. The results show that for five of the measures performance is comparable, and for one measure JDI is superior. We conclude that these results favor JDI, given the significantly greater intellectual overhead involved in human indexing and maintaining a rule base for mapping MeSH terms to MTs. We also note a JDI method that associates JDs with MeSH indexing rather than textwords, and it may be worthwhile to investigate whether this JDI method (statistical) and CISMeF (rule‐based) might be combined and then evaluated showing they are complementary to one another.

Suggested Citation

  • Susanne M. Humphrey & Aurélie Névéol & Allen Browne & Julien Gobeil & Patrick Ruch & Stéfan J. Darmoni, 2009. "Comparing a rule‐based versus statistical system for automatic categorization of MEDLINE documents according to biomedical specialty," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 60(12), pages 2530-2539, December.
  • Handle: RePEc:bla:jamist:v:60:y:2009:i:12:p:2530-2539
    DOI: 10.1002/asi.21170
    as

    Download full text from publisher

    File URL: https://doi.org/10.1002/asi.21170
    Download Restriction: no

    File URL: https://libkey.io/10.1002/asi.21170?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:jamist:v:60:y:2009:i:12:p:2530-2539. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: http://www.asis.org .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.