IDEAS home Printed from https://ideas.repec.org/a/spr/scient/v125y2020i3d10.1007_s11192-020-03526-1.html
   My bibliography  Save this article

Automatic prediction of citability of scientific articles by stylometry of their titles and abstracts

Author

Listed:
  • Sergio Jimenez

    (Instituto Caro y Cuervo)

  • Youlin Avila

    (CIC, Instituto Politécnico Nacional
    Universidad Pedagógica Nacional)

  • George Dueñas

    (Instituto Caro y Cuervo)

  • Alexander Gelbukh

    (CIC, Instituto Politécnico Nacional)

Abstract

The decision of reading or not a research paper is commonly made while reading its title and abstract. Although content and merit should lead to that decision, other factors such as writing style may intervene. Eventually, more readings could produce more citations. We investigated the stylistic factors in the title and abstract of research papers that affect their “citability”, and built a prediction model for citations at 5, 10, and 15 years. Since the number of citations is the preferred ranking function of several academic search engines, our “citability” function could alleviate the under-representation of recent not-yet-cited papers in query results. For this study, we collected a large dataset of around 750,000 titles and abstracts from articles in Scopus, intended to be representative of the entire science. For each instance, we extracted a relatively large set of 3578 stylistic features that were extracted at different linguistic levels, i.e. characters, syllables, tokens (i.e. words), sentences, stop/content words, and part-of-speech (POS) tags. Particularly, we present a novel set of corpus-based stylistic features that we called Corpus Spectral Signatures (CSS). We found out that a linear prediction model for citations (binned into quartiles) build with only the top-250 correlated features achieved a mean absolute error of 0.805 quartiles, and that on average, predictions were highly correlated with their real values (Spearman’s $$rho=0.515$$ r h o = 0.515 ). CSS features were among the top correlated features, but POS features were the most predictive group of features in an ablation study.

Suggested Citation

  • Sergio Jimenez & Youlin Avila & George Dueñas & Alexander Gelbukh, 2020. "Automatic prediction of citability of scientific articles by stylometry of their titles and abstracts," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(3), pages 3187-3232, December.
  • Handle: RePEc:spr:scient:v:125:y:2020:i:3:d:10.1007_s11192-020-03526-1
    DOI: 10.1007/s11192-020-03526-1
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11192-020-03526-1
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11192-020-03526-1?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Babak Sohrabi & Hamideh Iraj, 2017. "The effect of keyword repetition in abstract and keyword frequency per journal in predicting citation counts," Scientometrics, Springer;Akadémiai Kiadó, vol. 110(1), pages 243-251, January.
    2. Lakshmi Balachandran Nair & Michael Gibbert, 2016. "What makes a ‘good’ title and (how) does it matter for citations? A review and general model of article title attributes in management science," Scientometrics, Springer;Akadémiai Kiadó, vol. 107(3), pages 1331-1359, June.
    3. Feng Guo & Chao Ma & Qingling Shi & Qingqing Zong, 2018. "Succinct effect or informative effect: the relationship between title length and the number of citations," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(3), pages 1531-1539, September.
    4. Matthias Gnewuch & Klaus Wohlrabe, 2017. "Title characteristics and citations in economics," Scientometrics, Springer;Akadémiai Kiadó, vol. 110(3), pages 1573-1578, March.
    5. Didegah, Fereshteh & Thelwall, Mike, 2013. "Which factors help authors produce the highest impact research? Collaboration, journal and document properties," Journal of Informetrics, Elsevier, vol. 7(4), pages 861-873.
    6. Danielle H. Lee, 2019. "Predictive power of conference-related factors on citation rates of conference papers," Scientometrics, Springer;Akadémiai Kiadó, vol. 118(1), pages 281-304, January.
    7. Michal Brzezinski, 2015. "Power laws in citation distributions: evidence from Scopus," Scientometrics, Springer;Akadémiai Kiadó, vol. 103(1), pages 213-228, April.
    8. Per O. Seglen, 1992. "The skewness of science," Journal of the American Society for Information Science, Association for Information Science & Technology, vol. 43(9), pages 628-638, October.
    9. Fatemeh Rostami & Asghar Mohammadpoorasl & Mohammad Hajizadeh, 2014. "The effect of characteristics of title on citation rates of articles," Scientometrics, Springer;Akadémiai Kiadó, vol. 98(3), pages 2007-2010, March.
    10. Hamid R. Jamali & Mahsa Nikzad, 2011. "Article title type and its relation with the number of downloads and citations," Scientometrics, Springer;Akadémiai Kiadó, vol. 88(2), pages 653-661, August.
    11. Bornmann, Lutz & Leydesdorff, Loet, 2017. "Skewness of citation impact data and covariates of citation distributions: A large-scale empirical analysis based on Web of Science data," Journal of Informetrics, Elsevier, vol. 11(1), pages 164-175.
    12. Thelwall, Mike & Wilson, Paul, 2014. "Regression for citation data: An evaluation of different methods," Journal of Informetrics, Elsevier, vol. 8(4), pages 963-971.
    13. Iman Tahamtan & Askar Safipour Afshar & Khadijeh Ahamdzadeh, 2016. "Factors affecting number of citations: a comprehensive review of the literature," Scientometrics, Springer;Akadémiai Kiadó, vol. 107(3), pages 1195-1225, June.
    14. Mingfeng Lin & Henry C. Lucas & Galit Shmueli, 2013. "Research Commentary ---Too Big to Fail: Large Samples and the p -Value Problem," Information Systems Research, INFORMS, vol. 24(4), pages 906-917, December.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Shengzhi Huang & Jiajia Qian & Yong Huang & Wei Lu & Yi Bu & Jinqing Yang & Qikai Cheng, 2022. "Disclosing the relationship between citation structure and future impact of a publication," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 73(7), pages 1025-1042, July.
    2. Martorell Cunil, Onofre & Otero González, Luis & Durán Santomil, Pablo & Mulet Forteza, Carlos, 2023. "How to accomplish a highly cited paper in the tourism, leisure and hospitality field," Journal of Business Research, Elsevier, vol. 157(C).
    3. Li, Xin & Tang, Xuli & Cheng, Qikai, 2022. "Predicting the clinical citation count of biomedical papers using multilayer perceptron neural network," Journal of Informetrics, Elsevier, vol. 16(4).

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Martorell Cunil, Onofre & Otero González, Luis & Durán Santomil, Pablo & Mulet Forteza, Carlos, 2023. "How to accomplish a highly cited paper in the tourism, leisure and hospitality field," Journal of Business Research, Elsevier, vol. 157(C).
    2. William S. Pearson, 2021. "Quoted speech in linguistics research article titles: patterns of use and effects on citations," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(4), pages 3421-3442, April.
    3. Zhijun LI & Jinfen XU, 2019. "The evolution of research article titles: the case of Journal of Pragmatics 1978–2018," Scientometrics, Springer;Akadémiai Kiadó, vol. 121(3), pages 1619-1634, December.
    4. Sepideh Fahimifar & Khadijeh Mousavi & Fatemeh Mozaffari & Marcel Ausloos, 2023. "Identification of the most important external features of highly cited scholarly papers through 3 (i.e., Ridge, Lasso, and Boruta) feature selection data mining methods," Quality & Quantity: International Journal of Methodology, Springer, vol. 57(4), pages 3685-3712, August.
    5. Zahedi, Zohreh & Haustein, Stefanie, 2018. "On the relationships between bibliographic characteristics of scientific documents and citation and Mendeley readership counts: A large-scale analysis of Web of Science publications," Journal of Informetrics, Elsevier, vol. 12(1), pages 191-202.
    6. Kong, Ling & Wang, Dongbo, 2020. "Comparison of citations and attention of cover and non-cover papers," Journal of Informetrics, Elsevier, vol. 14(4).
    7. Mike Thelwall, 2017. "Avoiding obscure topics and generalising findings produces higher impact research," Scientometrics, Springer;Akadémiai Kiadó, vol. 110(1), pages 307-320, January.
    8. Brady D. Lund & Sanjay Kumar Maurya, 2020. "The relationship between highly-cited papers and the frequency of citations to other papers within-issue among three top information science journals," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(3), pages 2491-2504, December.
    9. Yangping Zhou, 2021. "Self-citation and citation of top journal publishers and their interpretation in the journal-discipline context," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(7), pages 6013-6040, July.
    10. Qianjin Zong & Yafen Xie & Rongchan Tuo & Jingshi Huang & Yang Yang, 2019. "The impact of video abstract on citation counts: evidence from a retrospective cohort study of New Journal of Physics," Scientometrics, Springer;Akadémiai Kiadó, vol. 119(3), pages 1715-1727, June.
    11. Juan Xie & Kaile Gong & Jiang Li & Qing Ke & Hyonchol Kang & Ying Cheng, 2019. "A probe into 66 factors which are possibly associated with the number of citations an article received," Scientometrics, Springer;Akadémiai Kiadó, vol. 119(3), pages 1429-1454, June.
    12. William S. Pearson, 2020. "Research article titles in written feedback on English as a second language writing," Scientometrics, Springer;Akadémiai Kiadó, vol. 123(2), pages 997-1019, May.
    13. Don Watson & Manfred Krug & Claus-Christian Carbon, 2022. "The relationship between citations and the linguistic traits of specific academic discourse communities identified by using social network analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(4), pages 1755-1781, April.
    14. Gianna Kexin Jiang & Yajun Jiang, 2023. "More diversity, more complexity, but more flexibility: research article titles in TESOL Quarterly, 1967–2022," Scientometrics, Springer;Akadémiai Kiadó, vol. 128(7), pages 3959-3980, July.
    15. Guan, Jiancheng & Yan, Yan & Zhang, Jing Jing, 2017. "The impact of collaboration and knowledge networks on citations," Journal of Informetrics, Elsevier, vol. 11(2), pages 407-422.
    16. Feng Guo & Chao Ma & Qingling Shi & Qingqing Zong, 2018. "Succinct effect or informative effect: the relationship between title length and the number of citations," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(3), pages 1531-1539, September.
    17. Brito, Ricardo & Rodríguez-Navarro, Alonso, 2018. "Research assessment by percentile-based double rank analysis," Journal of Informetrics, Elsevier, vol. 12(1), pages 315-329.
    18. Thelwall, Mike & Sud, Pardeep, 2016. "National, disciplinary and temporal variations in the extent to which articles with more authors have more impact: Evidence from a geometric field normalised citation indicator," Journal of Informetrics, Elsevier, vol. 10(1), pages 48-61.
    19. Giuseppe Pernagallo, 2023. "Science in the mist: A model of asymmetric information for the research market," Metroeconomica, Wiley Blackwell, vol. 74(2), pages 390-415, May.
    20. Tehmina Amjad & Nafeesa Shahid & Ali Daud & Asma Khatoon, 2022. "Citation burst prediction in a bibliometric network," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(5), pages 2773-2790, May.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:scient:v:125:y:2020:i:3:d:10.1007_s11192-020-03526-1. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.