IDEAS home Printed from https://ideas.repec.org/a/hin/jnlmpe/6399375.html
   My bibliography  Save this article

Design of New Word Retrieval Algorithm for Chinese-English Bilingual Parallel Corpus

Author

Listed:
  • Liting Zhang
  • Naeem Jan

Abstract

Natural language processing is an important direction in the field of computer science and artificial intelligence. It can realize various theories and methods of effective communication between humans and computers using natural language. Machine learning is a branch of natural language processing research, which is based on a large-scale English-Chinese database. Due to the relatively poor alignment corpus of English and Chinese bilingual sentences containing unknown words, machine translation is unprofessional and unbalanced, which is the problem studied in this paper. The purpose of this paper is to design and implement a length-based system for sentence alignment between English and Chinese bilingual texts. The research content of this paper is mainly divided into the following parts. First, the evaluation function of bilingual sentence alignment is designed, and on this basis, the bilingual sentence alignment algorithm based on the length and the optimal sentence pair sequence search algorithm is designed. In this paper, China National Knowledge Infrastructure (CNKI) is selected as an English-Chinese bilingual candidate website and English-Chinese bilingual web pages are downloaded. After analyzing the downloaded pages, nontext content such as page tags is removed, and bilingual text information is stored so as to establish an English-Chinese bilingual corpus based on segment alignment and retain English-Chinese bilingual keywords in the web pages. Second, extract the dictionary from the software StarDict, analyze the original dictionary format, and turn it into a custom dictionary format, which is convenient and better to use the double-sentence sentence alignment system, which is conducive to expanding the number of dictionaries and increasing the professionalism of vocabulary. Finally, we extract the stems of English words from the established corpus to simplify the complexity of English word processing, reduce the noise caused by the conversion of word parts of speech, and improve the operation efficiency. A bilingual sentence alignment system based on length is implemented. Finally, the system parameters are adjusted for comparative experiments to test the system performance.

Suggested Citation

  • Liting Zhang & Naeem Jan, 2022. "Design of New Word Retrieval Algorithm for Chinese-English Bilingual Parallel Corpus," Mathematical Problems in Engineering, Hindawi, vol. 2022, pages 1-9, March.
  • Handle: RePEc:hin:jnlmpe:6399375
    DOI: 10.1155/2022/6399375
    as

    Download full text from publisher

    File URL: http://downloads.hindawi.com/journals/mpe/2022/6399375.pdf
    Download Restriction: no

    File URL: http://downloads.hindawi.com/journals/mpe/2022/6399375.xml
    Download Restriction: no

    File URL: https://libkey.io/10.1155/2022/6399375?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:hin:jnlmpe:6399375. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Mohamed Abdelhakeem (email available below). General contact details of provider: https://www.hindawi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.