IDEAS home Printed from https://ideas.repec.org/a/gam/jmathe/v12y2024i3p405-d1327281.html
   My bibliography  Save this article

Keyword Pool Generation for Web Text Collecting: A Framework Integrating Sample and Semantic Information

Author

Listed:
  • Xiaolong Wu

    (School of Medicine, Xiamen University, Xiamen 361105, China
    National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen 361105, China
    Data Mining Research Center, Xiamen University, Xiamen 361005, China)

  • Chong Feng

    (Data Mining Research Center, Xiamen University, Xiamen 361005, China
    School of Mathematics and Statistics, Xiamen University of Technology, Xiamen 361105, China)

  • Qiyuan Li

    (School of Medicine, Xiamen University, Xiamen 361105, China
    National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen 361105, China)

  • Jianping Zhu

    (National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen 361105, China
    School of Mathematics and Statistics, Xiamen University of Technology, Xiamen 361105, China
    School of Management, Xiamen University, Xiamen 361005, China)

Abstract

Keyword pools are used as search queries to collect web texts, largely determining the size and coverage of the samples and provide a data base for subsequent text mining. However, how to generate a refined keyword pool with high similarity and some expandability is a challenge. Currently, keyword pools for search queries aimed at collecting web texts either lack an objective generation method and evaluation system, or have a low utilization rate of sample semantic information. Therefore, this paper proposed a keyword generation framework that integrates sample and semantic information to construct a complete and objective keyword pool generation and evaluation system. The framework includes a data phase and a modeling phase, and its core is in the modeling phase, where both feature ranking and model performance are considered. A regression model about a topic vector and word vectors is constructed for the first time based on word embedding, and keyword pools are generated from the perspective of model performance. In addition, two keyword generation methods, Recursive Feature Introduction (RFI) and Recursive Feature Introduction and Elimination (RFIE), are also proposed in this paper. Different feature ranking algorithms, keyword generation methods and regression models are compared in the experiments. The results show that: (1) When using RFI to generate keywords, the regression model using ranked features has better prediction performance than the baseline model, and the number of generated keywords is refiner, and the prediction performance of the regression model using tree-based ranked features is significantly better than that of the one using SHAP-based ranked features. (2) The prediction performance of the regression model using RFI with tree-based ranked features is significantly better than that using Recursive Feature Elimination (RFE) with tree-based one. (3) All four regression models using RFI/RFE with SHAP- based/tree-based ranked features have significantly higher average similarity scores and cumulative advantages than the baseline model (the model using RFI with unranked features). (4) Light Gradient Boosting Machine (LGBM) using RFI with SHAP-based ranked features has significantly better prediction performance, higher average similarity scores, and cumulative advantages. In conclusion, our framework can generate a keyword pool that is more similar to the topic, and more refined and expandable, which provides certain research ideas for expanding the research sample size while ensuring the coverage of topics in web text collecting.

Suggested Citation

  • Xiaolong Wu & Chong Feng & Qiyuan Li & Jianping Zhu, 2024. "Keyword Pool Generation for Web Text Collecting: A Framework Integrating Sample and Semantic Information," Mathematics, MDPI, vol. 12(3), pages 1-15, January.
  • Handle: RePEc:gam:jmathe:v:12:y:2024:i:3:p:405-:d:1327281
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2227-7390/12/3/405/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2227-7390/12/3/405/
    Download Restriction: no
    ---><---

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:12:y:2024:i:3:p:405-:d:1327281. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.