IDEAS home Printed from https://ideas.repec.org/a/gam/jmathe/v12y2024i21p3328-d1505083.html
   My bibliography  Save this article

Assessing Scientific Text Similarity: A Novel Approach Utilizing Non-Negative Matrix Factorization and Bidirectional Encoder Representations from Transformer

Author

Listed:
  • Zhixuan Jia

    (School of Information Management, Wuhan University, Wuhan 430072, China)

  • Wenfang Tian

    (School of Information Management, Wuhan University, Wuhan 430072, China)

  • Wang Li

    (School of Information Management, Wuhan University, Wuhan 430072, China)

  • Kai Song

    (Library, Shandong Normal University, Jinan 250358, China)

  • Fuxin Wang

    (School of Information Management, Wuhan University, Wuhan 430072, China)

  • Congjing Ran

    (School of Information Management, Wuhan University, Wuhan 430072, China
    Shenzhen Research Institute, Wuhan University, Shenzhen 518057, China)

Abstract

The patent serves as a vital component of scientific text, and over time, escalating competition has generated a substantial demand for patent analysis encompassing areas such as company strategy and legal services, necessitating fast, accurate, and easily applicable similarity estimators. At present, conducting natural language processing(NLP) on patent content, including titles, abstracts, etc., can serve as an effective method for estimating similarity. However, the traditional NLP approach has some disadvantages, such as the requirement for a huge amount of labeled data and poor explanation of deep-learning-based model internals, exacerbated by the high compression of patent content. On the other hand, most knowledge-based deep learning models require a vast amount of additional analysis results as training variables in similarity estimation, which are limited due to human participation in the analysis part. Thus, in this research, addressing these challenges, we introduce a novel estimator to enhance the transparency of similarity estimation. This approach integrates a patent’s content with international patent classification (IPC), leveraging bidirectional encoder representations from transformers (BERT), and non-negative matrix factorization (NMF). By integrating these techniques, we aim to improve knowledge discovery transparency in NLP across various IPC dimensions and incorporate more background knowledge into context similarity estimation. The experimental results demonstrate that our model is reliable, explainable, highly accurate, and practically usable.

Suggested Citation

  • Zhixuan Jia & Wenfang Tian & Wang Li & Kai Song & Fuxin Wang & Congjing Ran, 2024. "Assessing Scientific Text Similarity: A Novel Approach Utilizing Non-Negative Matrix Factorization and Bidirectional Encoder Representations from Transformer," Mathematics, MDPI, vol. 12(21), pages 1-18, October.
  • Handle: RePEc:gam:jmathe:v:12:y:2024:i:21:p:3328-:d:1505083
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2227-7390/12/21/3328/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2227-7390/12/21/3328/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Chan, C.S. Richard & Pethe, Charuta & Skiena, Steven, 2021. "Natural language processing versus rule-based text analysis: Comparing BERT score and readability indices to predict crowdfunding outcomes," Journal of Business Venturing Insights, Elsevier, vol. 16(C).
    2. Guo, Junfang & Wang, Xuefeng & Li, Qianrui & Zhu, Donghua, 2016. "Subject–action–object-based morphology analysis for determining the direction of technological change," Technological Forecasting and Social Change, Elsevier, vol. 105(C), pages 27-40.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Zhang, Yi & Huang, Ying & Porter, Alan L. & Zhang, Guangquan & Lu, Jie, 2019. "Discovering and forecasting interactions in big data research: A learning-enhanced bibliometric study," Technological Forecasting and Social Change, Elsevier, vol. 146(C), pages 795-807.
    2. Kyuwoong Kim & Kyeongmin Park & Sungjoo Lee, 2019. "Investigating technology opportunities: the use of SAOx analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 118(1), pages 45-70, January.
    3. Jiang, Cuiqing & Zhou, Yiru & Chen, Bo, 2023. "Mining semantic features in patent text for financial distress prediction," Technological Forecasting and Social Change, Elsevier, vol. 190(C).
    4. Myeongji Oh & Hyejin Jang & Sunhye Kim & Byungun Yoon, 2023. "Main path analysis for technological development using SAO structure and DEMATEL based on keyword causality," Scientometrics, Springer;Akadémiai Kiadó, vol. 128(4), pages 2079-2104, April.
    5. Ren, Haiying & Zhao, Yuhui, 2021. "Technology opportunity discovery based on constructing, evaluating, and searching knowledge networks," Technovation, Elsevier, vol. 101(C).
    6. Cheng, Yu & Huang, Lucheng & Ramlogan, Ronnie & Li, Xin, 2017. "Forecasting of potential impacts of disruptive technology in promising technological areas: Elaborating the SIRS epidemic model in RFID technology," Technological Forecasting and Social Change, Elsevier, vol. 117(C), pages 170-183.
    7. Liu, Zhenfeng & Feng, Jian & Uden, Lorna, 2023. "Technology opportunity analysis using hierarchical semantic networks and dual link prediction," Technovation, Elsevier, vol. 128(C).
    8. Chen, Liang & Xu, Shuo & Zhu, Lijun & Zhang, Jing & Yang, Guancan & Xu, Haiyun, 2022. "A deep learning based method benefiting from characteristics of patents for semantic relation classification," Journal of Informetrics, Elsevier, vol. 16(3).
    9. Lee, Changyong, 2021. "A review of data analytics in technological forecasting," Technological Forecasting and Social Change, Elsevier, vol. 166(C).
    10. Zhou, Xiao & Huang, Lu & Porter, Alan & Vicente-Gomila, Jose M., 2019. "Tracing the system transformations and innovation pathways of an emerging technology: Solid lipid nanoparticles," Technological Forecasting and Social Change, Elsevier, vol. 146(C), pages 785-794.
    11. Yuan Zhou & Fang Dong & Yufei Liu & Liang Ran, 2021. "A deep learning framework to early identify emerging technologies in large-scale outlier patents: an empirical study of CNC machine tool," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(2), pages 969-994, February.
    12. An, Jaehyeong & Kim, Kyuwoong & Mortara, Letizia & Lee, Sungjoo, 2018. "Deriving technology intelligence from patents: Preposition-based semantic analysis," Journal of Informetrics, Elsevier, vol. 12(1), pages 217-236.
    13. Richarz, Jan & Wegewitz, Stephan & Henn, Sarah & Müller, Dirk, 2023. "Graph-based research field analysis by the use of natural language processing: An overview of German energy research," Technological Forecasting and Social Change, Elsevier, vol. 186(PB).
    14. Vicente-Gomila, J.M. & Artacho-Ramírez, M.A. & Ting, Ma & Porter, A.L., 2021. "Combining tech mining and semantic TRIZ for technology assessment: Dye-sensitized solar cell as a case," Technological Forecasting and Social Change, Elsevier, vol. 169(C).
    15. Chao Yang & Donghua Zhu & Xuefeng Wang & Yi Zhang & Guangquan Zhang & Jie Lu, 2017. "Requirement-oriented core technological components’ identification based on SAO analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 112(3), pages 1229-1248, September.
    16. Ma, Jing & Abrams, Natalie F. & Porter, Alan L. & Zhu, Donghua & Farrell, Dorothy, 2019. "Identifying translational indicators and technology opportunities for nanomedical research using tech mining: The case of gold nanostructures," Technological Forecasting and Social Change, Elsevier, vol. 146(C), pages 767-775.
    17. Byungun Yoon & Songhee Kim & Sunhye Kim & Hyeonju Seol, 2022. "Doc2vec-based link prediction approach using SAO structures: application to patent network," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(9), pages 5385-5414, September.
    18. Zhang, Yi & Wu, Mengjia & Miao, Wen & Huang, Lu & Lu, Jie, 2021. "Bi-layer network analytics: A methodology for characterizing emerging general-purpose technologies," Journal of Informetrics, Elsevier, vol. 15(4).
    19. Liang Chen & Shuo Xu & Lijun Zhu & Jing Zhang & Xiaoping Lei & Guancan Yang, 2020. "A deep learning based method for extracting semantic information from patent documents," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(1), pages 289-312, October.
    20. Wang, Jinfeng & Zhang, Zhixin & Feng, Lijie & Lin, Kuo-Yi & Liu, Peng, 2023. "Development of technology opportunity analysis based on technology landscape by extending technology elements with BERT and TRIZ," Technological Forecasting and Social Change, Elsevier, vol. 191(C).

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:12:y:2024:i:21:p:3328-:d:1505083. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.