IDEAS home Printed from https://ideas.repec.org/a/bla/jamist/v57y2006i2p208-221.html
   My bibliography  Save this article

Link‐based similarity measures for the classification of Web documents

Author

Listed:
  • Pável Calado
  • Marco Cristo
  • Marcos André Gonçalves
  • Edleno S. de Moura
  • Berthier Ribeiro‐Neto
  • Nivio Ziviani

Abstract

Traditional text‐based document classifiers tend to perform poorly on the Web. Text in Web documents is usually noisy and often does not contain enough information to determine their topic. However, the Web provides a different source that can be useful to document classification: its hyperlink structure. In this work, the authors evaluate how the link structure of the Web can be used to determine a measure of similarity appropriate for document classification. They experiment with five different similarity measures and determine their adequacy for predicting the topic of a Web page. Tests performed on a Web directory show that link information alone allows classifying documents with an average precision of 86%. Further, when combined with a traditional text‐based classifier, precision increases to values of up to 90%, representing gains that range from 63 to 132% over the use of text‐based classification alone. Because the measures proposed in this article are straightforward to compute, they provide a practical and effective solution for Web classification and related information retrieval tasks. Further, the authors provide an important set of guidelines on how link structure can be used effectively to classify Web documents.

Suggested Citation

  • Pável Calado & Marco Cristo & Marcos André Gonçalves & Edleno S. de Moura & Berthier Ribeiro‐Neto & Nivio Ziviani, 2006. "Link‐based similarity measures for the classification of Web documents," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 57(2), pages 208-221, January.
  • Handle: RePEc:bla:jamist:v:57:y:2006:i:2:p:208-221
    DOI: 10.1002/asi.20266
    as

    Download full text from publisher

    File URL: https://doi.org/10.1002/asi.20266
    Download Restriction: no

    File URL: https://libkey.io/10.1002/asi.20266?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:jamist:v:57:y:2006:i:2:p:208-221. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: http://www.asis.org .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.