Author
Listed:
- Christopher C. Yang
- Johnny W.K. Luk
- Stanley K. Yung
- Jerome Yen
Abstract
Digital libraries store materials in electronic format. Research and development in digital libraries includes content creation, conversion, indexing, organization, and dissemination. The key technological issues are how to search and display desired selections from and across large collections effectively [Schatz & Chen, 1996]. Digital library research projects (DLI‐1) sponsored by NSF/DARPA/NASA have a common theme of bringing search to the net, which is the flagship research effort for the National Information Infrastructure (NII) in the United States. A repository is an indexed collection of objects. Indexing is an important task for searching. The better the indexing, the better the searching result. Developing a universal digital library has been the dream of many researchers, however, there are still many problems to be solved before such a vision is fulfilled. The most critical is to support a cross‐lingual retrieval or multilingual digital library. Much work has been done on English information retrieval, however, there is relatively less work on Chinese information retrieval. In this article, we focus on Chinese indexing, which is the foundation of Chinese and cross‐lingual information retrieval. The smallest indexing units in Chinese digital libraries are words, while the smallest units in a Chinese sentence are characters. However, Chinese text has no delimiter to mark word boundaries as it is in English text. In English or other languages using Roman or Greek‐based orthographies, often, spacing reliably indicates word boundaries. In Chinese, a number of characters are placed together without any delimiters indicating the boundaries between consecutive characters. In this article, we investigate the combination and boundary detection approaches based on mutual information for segmentation. The combination approach combines n‐grams to form words with more number of characters. In the combination approach Algorithm 1 does not allow overlapping of n‐grams while Algorithm 2 does. The boundary detection approach detects the segmentation points on a sentence based on the values and the change of values of the mutual information. Experiments are conducted to evaluate their performances. An interface of the system is also presented to show how a Chinese web page is downloaded, the text in the page filtered, and segmented into words. The segmented words can be submitted for indexing or new unknown words can be identified and submitted to a dictionary.
Suggested Citation
Christopher C. Yang & Johnny W.K. Luk & Stanley K. Yung & Jerome Yen, 2000.
"Combination and boundary detection approaches on Chinese indexing,"
Journal of the American Society for Information Science, Association for Information Science & Technology, vol. 51(4), pages 340-351.
Handle:
RePEc:bla:jamest:v:51:y:2000:i:4:p:340-351
DOI: 10.1002/(SICI)1097-4571(2000)51:43.0.CO;2-I
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:jamest:v:51:y:2000:i:4:p:340-351. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: http://www.asis.org .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.