IDEAS home Printed from https://ideas.repec.org/a/spr/qualqt/v57y2023i4d10.1007_s11135-022-01480-z.html
   My bibliography  Save this article

Identification of the most important external features of highly cited scholarly papers through 3 (i.e., Ridge, Lasso, and Boruta) feature selection data mining methods

Author

Listed:
  • Sepideh Fahimifar

    (University of Tehran)

  • Khadijeh Mousavi

    (University of Tehran)

  • Fatemeh Mozaffari

    (University of Tehran)

  • Marcel Ausloos

    (University of Leicester, Brookfield
    Bucharest University of Economic Studies
    GRAPES, Rue de La Belle Jardiniere)

Abstract

Highly cited papers are influenced by external factors that are not directly related to the document's intrinsic quality. In this study, 50 characteristics for measuring the performance of 68 highly cited papers, from the Journal of The American Medical Informatics Association indexed in Web of Science (WOS), from 2009 to 2019 were investigated. In the first step, a Pearson correlation analysis is performed to eliminate variables with zero or weak correlation with the target (“dependent”) variable (number of citations in WOS). Consequently, 32 variables are selected for the next step. By applying the Ridge technique, 13 features show a positive effect on the number of citations. Using three different algorithms, i.e., Ridge, Lasso, and Boruta, 6 factors appear to be the most relevant ones. The "Number of citations by international researchers", "Journal self-citations in citing documents”, and "Authors' self-citations in citing documents”, are recognized as the most important features by all three methods here used. The "First author's scientific age”, "Open-access paper”, and "Number of first author's citations in WOS" are identified as the important features of highly cited papers by only two methods, Ridge and Lasso. Notice that we use specific machine learning algorithms as feature selection methods (Ridge, Lasso, and Boruta) to identify the most important features of highly cited papers, tools that had not previously been used for this purpose. In conclusion, we re-emphasize the performance resulting from such algorithms. Moreover, we do not advise authors to seek to increase the citations of their articles by manipulating the identified performance features. Indeed, ethical rules regarding these characteristics must be strictly obeyed.

Suggested Citation

  • Sepideh Fahimifar & Khadijeh Mousavi & Fatemeh Mozaffari & Marcel Ausloos, 2023. "Identification of the most important external features of highly cited scholarly papers through 3 (i.e., Ridge, Lasso, and Boruta) feature selection data mining methods," Quality & Quantity: International Journal of Methodology, Springer, vol. 57(4), pages 3685-3712, August.
  • Handle: RePEc:spr:qualqt:v:57:y:2023:i:4:d:10.1007_s11135-022-01480-z
    DOI: 10.1007/s11135-022-01480-z
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11135-022-01480-z
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11135-022-01480-z?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Iina Hellsten & Renaud Lambiotte & Andrea Scharnhorst & Marcel Ausloos, 2007. "Self-citations, co-authorships and keywords: A new approach to scientists’ field mobility?," Scientometrics, Springer;Akadémiai Kiadó, vol. 72(3), pages 469-486, September.
    2. JingJing Zhang & Jiancheng Guan, 2017. "Scientific relatedness and intellectual base: a citation analysis of un-cited and highly-cited papers in the solar energy field," Scientometrics, Springer;Akadémiai Kiadó, vol. 110(1), pages 141-162, January.
    3. Babak Sohrabi & Hamideh Iraj, 2017. "The effect of keyword repetition in abstract and keyword frequency per journal in predicting citation counts," Scientometrics, Springer;Akadémiai Kiadó, vol. 110(1), pages 243-251, January.
    4. Feng Guo & Chao Ma & Qingling Shi & Qingqing Zong, 2018. "Succinct effect or informative effect: the relationship between title length and the number of citations," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(3), pages 1531-1539, September.
    5. Chang, Chia-Lin & McAleer, Michael & Oxley, Les, 2013. "Coercive journal self citations, impact factor, Journal Influence and Article Influence," Mathematics and Computers in Simulation (MATCOM), Elsevier, vol. 93(C), pages 190-197.
    6. M.H. MacRoberts & B.R. MacRoberts, 2010. "Problems of citation analysis: A study of uncited and seldom-cited influences," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 61(1), pages 1-12, January.
    7. Claudiu Herteliu & Marcel Ausloos & Bogdan Vasile Ileanu & Giulia Rotundo & Tudorel Andrei, 2017. "Quantitative and Qualitative Analysis of Editor Behavior through Potentially Coercive Citations," Publications, MDPI, vol. 5(2), pages 1-16, June.
    8. Veronica Perez-Cabezas & Carmen Ruiz-Molinero & Ines Carmona-Barrientos & Enrique Herrera-Viedma & Manuel J. Cobo & Jose A. Moral-Munoz, 2018. "Highly cited papers in rheumatology: identification and conceptual analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(1), pages 555-568, July.
    9. Johann Bauer & Loet Leydesdorff & Lutz Bornmann, 2016. "Highly cited papers in Library and Information Science (LIS): Authors, institutions, and network structures," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 67(12), pages 3095-3100, December.
    10. Didegah, Fereshteh & Thelwall, Mike, 2013. "Which factors help authors produce the highest impact research? Collaboration, journal and document properties," Journal of Informetrics, Elsevier, vol. 7(4), pages 861-873.
    11. Robert Tibshirani, 2011. "Regression shrinkage and selection via the lasso: a retrospective," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 73(3), pages 273-282, June.
    12. Dag W. Aksnes & Liv Langfeldt & Paul Wouters, 2019. "Citations, Citation Indicators, and Research Quality: An Overview of Basic Concepts and Theories," SAGE Open, , vol. 9(1), pages 21582440198, February.
    13. Keshav Singh Rawat & Sandeep Kumar Sood, 2021. "Emerging trends and global scope of big data analytics: a scientometric analysis," Quality & Quantity: International Journal of Methodology, Springer, vol. 55(4), pages 1371-1396, August.
    14. Nick Haslam & Lauren Ban & Leah Kaufmann & Stephen Loughnan & Kim Peters & Jennifer Whelan & Sam Wilson, 2008. "What makes an article influential? Predicting impact in social and personality psychology," Scientometrics, Springer;Akadémiai Kiadó, vol. 76(1), pages 169-185, July.
    15. Ronald N. Kostoff, 2007. "The difference between highly and poorly cited medical articles in the journal Lancet," Scientometrics, Springer;Akadémiai Kiadó, vol. 72(3), pages 513-520, September.
    16. Matthew E Falagas & Angeliki Zarkali & Drosos E Karageorgopoulos & Vangelis Bardakas & Michael N Mavros, 2013. "The Impact of Article Length on the Number of Future Citations: A Bibliometric Analysis of General Medicine Journals," PLOS ONE, Public Library of Science, vol. 8(2), pages 1-8, February.
    17. Chin-Yuan Chen & Gin-Shuh Liang & Yuhling Su & Mao-Sheng Liao, 2014. "A data mining algorithm for fuzzy transaction data," Quality & Quantity: International Journal of Methodology, Springer, vol. 48(6), pages 2963-2971, November.
    18. Min Song & Su Yeon Kim, 2013. "Detecting the knowledge structure of bioinformatics by mining full-text collections," Scientometrics, Springer;Akadémiai Kiadó, vol. 96(1), pages 183-201, July.
    19. Anton Oleinik, 2022. "Relevance in Web search: between content, authority and popularity," Quality & Quantity: International Journal of Methodology, Springer, vol. 56(1), pages 173-194, February.
    20. Ale Ebrahim, Nader & Salehi, Hadi & Embi, Mohamed Amin & Habibi Tanha, Farid & Gholizadeh, Hossein & Motahar, Seyed Mohammad & Ordi, Ali, 2013. "Effective Strategies for Increasing Citation Frequency," MPRA Paper 50919, University Library of Munich, Germany, revised 12 Oct 2013.
    21. Hamid R. Jamali & Mahsa Nikzad, 2011. "Article title type and its relation with the number of downloads and citations," Scientometrics, Springer;Akadémiai Kiadó, vol. 88(2), pages 653-661, August.
    22. Friedman, Jerome H. & Hastie, Trevor & Tibshirani, Rob, 2010. "Regularization Paths for Generalized Linear Models via Coordinate Descent," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 33(i01).
    23. Tamara Krajna & Jelka Petrak, 2019. "Croatian Highly Cited Papers," Interdisciplinary Description of Complex Systems - scientific journal, Croatian Interdisciplinary Society Provider Homepage: http://indecs.eu, vol. 17(3-B), pages 684-696.
    24. M. A. Martínez & M. Herrera & E. Contreras & A. Ruíz & E. Herrera-Viedma, 2015. "Characterizing highly cited papers in Social Work through H-Classics," Scientometrics, Springer;Akadémiai Kiadó, vol. 102(2), pages 1713-1729, February.
    25. Fenghua Wang & Ying Fan & An Zeng & Zengru Di, 2019. "Can we predict ESI highly cited publications?," Scientometrics, Springer;Akadémiai Kiadó, vol. 118(1), pages 109-125, January.
    26. Lei Lei & Yunmei Sun, 2020. "Should highly cited items be excluded in impact factor calculation? The effect of review articles on journal impact factor," Scientometrics, Springer;Akadémiai Kiadó, vol. 122(3), pages 1697-1706, March.
    27. Xinmin Zhang & Ronald C Estoque & Hualin Xie & Yuji Murayama & Manjula Ranagalage, 2019. "Bibliometric analysis of highly cited articles on ecosystem services," PLOS ONE, Public Library of Science, vol. 14(2), pages 1-16, February.
    28. Mehdi Rhaiem, 2017. "Measurement and determinants of academic research efficiency: a systematic review of the evidence," Scientometrics, Springer;Akadémiai Kiadó, vol. 110(2), pages 581-615, February.
    29. Natsuo Onodera & Fuyuki Yoshikane, 2015. "Factors affecting citation rates of research articles," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 66(4), pages 739-764, April.
    30. Nan Zhang & Shanshan Wan & Peiling Wang & Peng Zhang & Qiang Wu, 2018. "A bibliometric analysis of highly cited papers in the field of Economics and Business based on the Essential Science Indicators database," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(2), pages 1039-1053, August.
    31. Ponomarev, Ilya V. & Williams, Duane E. & Hackett, Charles J. & Schnell, Joshua D. & Haak, Laurel L., 2014. "Predicting highly cited papers: A Method for Early Detection of Candidate Breakthroughs," Technological Forecasting and Social Change, Elsevier, vol. 81(C), pages 49-55.
    32. Juan Xie & Kaile Gong & Jiang Li & Qing Ke & Hyonchol Kang & Ying Cheng, 2019. "A probe into 66 factors which are possibly associated with the number of citations an article received," Scientometrics, Springer;Akadémiai Kiadó, vol. 119(3), pages 1429-1454, June.
    33. Iman Tahamtan & Askar Safipour Afshar & Khadijeh Ahamdzadeh, 2016. "Factors affecting number of citations: a comprehensive review of the literature," Scientometrics, Springer;Akadémiai Kiadó, vol. 107(3), pages 1195-1225, June.
    34. Vanclay, Jerome K., 2013. "Factors affecting citation rates in environmental science," Journal of Informetrics, Elsevier, vol. 7(2), pages 265-271.
    35. Hu, Ya-Han & Tai, Chun-Tien & Liu, Kang Ernest & Cai, Cheng-Fang, 2020. "Identification of highly-cited papers using topic-model-based and bibliometric features: the consideration of keyword popularity," Journal of Informetrics, Elsevier, vol. 14(1).
    36. Bornmann, Lutz & Schier, Hermann & Marx, Werner & Daniel, Hans-Dieter, 2012. "What factors determine citation counts of publications in chemistry besides their quality?," Journal of Informetrics, Elsevier, vol. 6(1), pages 11-18.
    37. Dag W Aksnes, 2003. "Characteristics of highly cited papers," Research Evaluation, Oxford University Press, vol. 12(3), pages 159-170, December.
    38. Liao, Huchang & Tang, Ming & Li, Zongmin & Lev, Benjamin, 2019. "Bibliometric analysis for highly cited papers in operations research and management science from 2008 to 2017 based on Essential Science Indicators," Omega, Elsevier, vol. 88(C), pages 223-236.
    39. Roberto Franzosi, 2021. "What’s in a text? Bridging the gap between quality and quantity in the digital era," Quality & Quantity: International Journal of Methodology, Springer, vol. 55(4), pages 1513-1540, August.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Martorell Cunil, Onofre & Otero González, Luis & Durán Santomil, Pablo & Mulet Forteza, Carlos, 2023. "How to accomplish a highly cited paper in the tourism, leisure and hospitality field," Journal of Business Research, Elsevier, vol. 157(C).
    2. Ruan, Xuanmin & Zhu, Yuanyang & Li, Jiang & Cheng, Ying, 2020. "Predicting the citation counts of individual papers via a BP neural network," Journal of Informetrics, Elsevier, vol. 14(3).
    3. Mingyang Wang & Shi Li & Guangsheng Chen, 2017. "Detecting latent referential articles based on their vitality performance in the latest 2 years," Scientometrics, Springer;Akadémiai Kiadó, vol. 112(3), pages 1557-1571, September.
    4. Kong, Ling & Wang, Dongbo, 2020. "Comparison of citations and attention of cover and non-cover papers," Journal of Informetrics, Elsevier, vol. 14(4).
    5. Peter Sjögårde & Fereshteh Didegah, 2022. "The association between topic growth and citation impact of research publications," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(4), pages 1903-1921, April.
    6. Mingyang Wang & Zhenyu Wang & Guangsheng Chen, 2019. "Which can better predict the future success of articles? Bibliometric indices or alternative metrics," Scientometrics, Springer;Akadémiai Kiadó, vol. 119(3), pages 1575-1595, June.
    7. Wanjun Xia & Tianrui Li & Chongshou Li, 2023. "A review of scientific impact prediction: tasks, features and methods," Scientometrics, Springer;Akadémiai Kiadó, vol. 128(1), pages 543-585, January.
    8. Iman Tahamtan & Askar Safipour Afshar & Khadijeh Ahamdzadeh, 2016. "Factors affecting number of citations: a comprehensive review of the literature," Scientometrics, Springer;Akadémiai Kiadó, vol. 107(3), pages 1195-1225, June.
    9. Li, Xin & Ma, Xiaodi & Feng, Ye, 2024. "Early identification of breakthrough research from sleeping beauties using machine learning," Journal of Informetrics, Elsevier, vol. 18(2).
    10. Zhang, Xinyuan & Xie, Qing & Song, Min, 2021. "Measuring the impact of novelty, bibliometric, and academic-network factors on citation count using a neural network," Journal of Informetrics, Elsevier, vol. 15(2).
    11. Zahedi, Zohreh & Haustein, Stefanie, 2018. "On the relationships between bibliographic characteristics of scientific documents and citation and Mendeley readership counts: A large-scale analysis of Web of Science publications," Journal of Informetrics, Elsevier, vol. 12(1), pages 191-202.
    12. Juan Xie & Kaile Gong & Ying Cheng & Qing Ke, 2019. "The correlation between paper length and citations: a meta-analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 118(3), pages 763-786, March.
    13. Ha, Taehyun, 2022. "An explainable artificial-intelligence-based approach to investigating factors that influence the citation of papers," Technological Forecasting and Social Change, Elsevier, vol. 184(C).
    14. Anqi Ma & Yu Liu & Xiujuan Xu & Tao Dong, 2021. "A deep-learning based citation count prediction model with paper metadata semantic features," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(8), pages 6803-6823, August.
    15. Maksym Polyakov & Serhiy Polyakov & Md Sayed Iftekhar, 2017. "Does academic collaboration equally benefit impact of research across topics? The case of agricultural, resource, environmental and ecological economics," Scientometrics, Springer;Akadémiai Kiadó, vol. 113(3), pages 1385-1405, December.
    16. Kaile Gong & Juan Xie & Ying Cheng & Vincent Larivière & Cassidy R. Sugimoto, 2019. "The citation advantage of foreign language references for Chinese social science papers," Scientometrics, Springer;Akadémiai Kiadó, vol. 120(3), pages 1439-1460, September.
    17. Sergio Jimenez & Youlin Avila & George Dueñas & Alexander Gelbukh, 2020. "Automatic prediction of citability of scientific articles by stylometry of their titles and abstracts," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(3), pages 3187-3232, December.
    18. Bornmann, Lutz & Haunschild, Robin & Mutz, Rüdiger, 2020. "Should citations be field-normalized in evaluative bibliometrics? An empirical analysis based on propensity score matching," Journal of Informetrics, Elsevier, vol. 14(4).
    19. Liu, Qiuling & Guo, Lei & Sun, Yiping & Ren, Linlin & Wang, Xinhua & Han, Xiaohui, 2024. "Do scholars' collaborative tendencies impact the quality of their publications? A generalized propensity score matching analysis," Journal of Informetrics, Elsevier, vol. 18(1).
    20. Juan Xie & Kaile Gong & Jiang Li & Qing Ke & Hyonchol Kang & Ying Cheng, 2019. "A probe into 66 factors which are possibly associated with the number of citations an article received," Scientometrics, Springer;Akadémiai Kiadó, vol. 119(3), pages 1429-1454, June.

    More about this item

    Keywords

    Highly cited articles; Feature selections; Altmetrics; Ridge; Lasso; Boruta;
    All these keywords.

    JEL classification:

    • C80 - Mathematical and Quantitative Methods - - Data Collection and Data Estimation Methodology; Computer Programs - - - General
    • Y80 - Miscellaneous Categories - - Related Disciplines - - - Related Disciplines

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:qualqt:v:57:y:2023:i:4:d:10.1007_s11135-022-01480-z. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.