IDEAS home Printed from https://ideas.repec.org/a/bla/jamist/v58y2007i12p1884-1898.html
   My bibliography  Save this article

Data cleansing for Web information retrieval using query independent features

Author

Listed:
  • Yiqun Liu
  • Min Zhang
  • Rongwei Cen
  • Liyun Ru
  • Shaoping Ma

Abstract

Understanding what kinds of Web pages are the most useful for Web search engine users is a critical task in Web information retrieval (IR). Most previous works used hyperlink analysis algorithms to solve this problem. However, little research has been focused on query‐independent Web data cleansing for Web IR. In this paper, we first provide analysis of the differences between retrieval target pages and ordinary ones based on more than 30 million Web pages obtained from both the Text Retrieval Conference (TREC) and a widely used Chinese search engine, SOGOU (www.sogou.com). We further propose a learning‐based data cleansing algorithm for reducing Web pages that are unlikely to be useful for user requests. We found that there exists a large proportion of low‐quality Web pages in both the English and the Chinese Web page corpus, and retrieval target pages can be identified using query‐independent features and cleansing algorithms. The experimental results showed that our algorithm is effective in reducing a large portion of Web pages with a small loss in retrieval target pages. It makes it possible for Web IR tools to meet a large fraction of users' needs with only a small part of pages on the Web. These results may help Web search engines make better use of their limited storage and computation resources to improve search performance.

Suggested Citation

  • Yiqun Liu & Min Zhang & Rongwei Cen & Liyun Ru & Shaoping Ma, 2007. "Data cleansing for Web information retrieval using query independent features," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 58(12), pages 1884-1898, October.
  • Handle: RePEc:bla:jamist:v:58:y:2007:i:12:p:1884-1898
    DOI: 10.1002/asi.20633
    as

    Download full text from publisher

    File URL: https://doi.org/10.1002/asi.20633
    Download Restriction: no

    File URL: https://libkey.io/10.1002/asi.20633?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Jozef Kapusta & Michal Munk & Martin Drlik, 2018. "Website Structure Improvement Based on the Combination of Selected Web Structure and Web Usage Mining Methods," International Journal of Information Technology & Decision Making (IJITDM), World Scientific Publishing Co. Pte. Ltd., vol. 17(06), pages 1743-1776, November.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:jamist:v:58:y:2007:i:12:p:1884-1898. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: http://www.asis.org .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.