IDEAS home Printed from https://ideas.repec.org/a/bla/jinfst/v73y2022i1p58-69.html
   My bibliography  Save this article

Gender identification on Twitter

Author

Listed:
  • Catherine Ikae
  • Jacques Savoy

Abstract

To determine the author of a text's gender, various feature types have been suggested (e.g., function words, n‐gram of letters, etc.) leading to a huge number of stylistic markers. To determine the target category, different machine learning models have been suggested (e.g., logistic regression, decision tree, k nearest‐neighbors, support vector machine, naïve Bayes, neural networks, and random forest). In this study, our first objective is to know whether or not the same model always proposes the best effectiveness when considering similar corpora under the same conditions. Thus, based on 7 CLEF‐PAN collections, this study analyzes the effectiveness of 10 different classifiers. Our second aim is to propose a 2‐stage feature selection to reduce the feature size to a few hundred terms without any significant change in the performance level compared to approaches using all the attributes (increase of around 5% after applying the proposed feature selection). Based on our experiments, neural network or random forest tend, on average, to produce the highest effectiveness. Moreover, empirical evidence indicates that reducing the feature set size to around 300 without penalizing the effectiveness is possible. Finally, based on such reduced feature sizes, an analysis reveals some of the specific terms that clearly discriminate between the 2 genders.

Suggested Citation

  • Catherine Ikae & Jacques Savoy, 2022. "Gender identification on Twitter," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 73(1), pages 58-69, January.
  • Handle: RePEc:bla:jinfst:v:73:y:2022:i:1:p:58-69
    DOI: 10.1002/asi.24541
    as

    Download full text from publisher

    File URL: https://doi.org/10.1002/asi.24541
    Download Restriction: no

    File URL: https://libkey.io/10.1002/asi.24541?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Friedman, Jerome H., 2002. "Stochastic gradient boosting," Computational Statistics & Data Analysis, Elsevier, vol. 38(4), pages 367-378, February.
    2. Donna Harman, 1991. "How effective is suffixing?," Journal of the American Society for Information Science, Association for Information Science & Technology, vol. 42(1), pages 7-15, January.
    3. Sasa Adamovic & Vladislav Miskovic & Milan Milosavljevic & Marko Sarac & Mladen Veinovic, 2019. "Automated language‐independent authorship verification (for Indo‐European languages)," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 70(8), pages 858-871, August.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Mansoor, Umer & Jamal, Arshad & Su, Junbiao & Sze, N.N. & Chen, Anthony, 2023. "Investigating the risk factors of motorcycle crash injury severity in Pakistan: Insights and policy recommendations," Transport Policy, Elsevier, vol. 139(C), pages 21-38.
    2. Matthew Smith & Francisco Alvarez, 2022. "Predicting Firm-Level Bankruptcy in the Spanish Economy Using Extreme Gradient Boosting," Computational Economics, Springer;Society for Computational Economics, vol. 59(1), pages 263-295, January.
    3. Peiró-Signes, Ángel & Segarra-Oña, Marival & Trull-Domínguez, Óscar & Sánchez-Planelles, Joaquín, 2022. "Exposing the ideal combination of endogenous–exogenous drivers for companies’ ecoinnovative orientation: Results from machine-learning methods," Socio-Economic Planning Sciences, Elsevier, vol. 79(C).
    4. Richard Berk, 2019. "Accuracy and Fairness for Juvenile Justice Risk Assessments," Journal of Empirical Legal Studies, John Wiley & Sons, vol. 16(1), pages 175-194, March.
    5. Robert Suchting & Michael S. Businelle & Stephen W. Hwang & Nikhil S. Padhye & Yijiong Yang & Diane M. Santa Maria, 2020. "Predicting Daily Sheltering Arrangements among Youth Experiencing Homelessness Using Diary Measurements Collected by Ecological Momentary Assessment," IJERPH, MDPI, vol. 17(18), pages 1-17, September.
    6. Müller, Daniel & Leitão, Pedro J. & Sikor, Thomas, 2013. "Comparing the determinants of cropland abandonment in Albania and Romania using boosted regression trees," Agricultural Systems, Elsevier, vol. 117(C), pages 66-77.
    7. Bissan Ghaddar & Ignacio Gómez-Casares & Julio González-Díaz & Brais González-Rodríguez & Beatriz Pateiro-López & Sofía Rodríguez-Ballesteros, 2023. "Learning for Spatial Branching: An Algorithm Selection Approach," INFORMS Journal on Computing, INFORMS, vol. 35(5), pages 1024-1043, September.
    8. Huang Lin & Merete Eggesbø & Shyamal Das Peddada, 2022. "Linear and nonlinear correlation estimators unveil undescribed taxa interactions in microbiome data," Nature Communications, Nature, vol. 13(1), pages 1-16, December.
    9. Akash Malhotra, 2018. "A hybrid econometric-machine learning approach for relative importance analysis: Prioritizing food policy," Papers 1806.04517, arXiv.org, revised Aug 2020.
    10. Somodi, Imelda & Bede-Fazekas, Ákos & Botta-Dukát, Zoltán & Molnár, Zsolt, 2024. "Confidence and consistency in discrimination: A new family of evaluation metrics for potential distribution models," Ecological Modelling, Elsevier, vol. 491(C).
    11. María Jesús Segovia‐Vargas & I. Marta Miranda‐García & Freddy Alejandro Oquendo‐Torres, 2023. "Sustainable finance: The role of savings and credit cooperatives in Ecuador," Annals of Public and Cooperative Economics, Wiley Blackwell, vol. 94(3), pages 951-980, September.
    12. Tesfamariam Engida Mengesha & Lulseged Tamene Desta & Paolo Gamba & Getachew Tesfaye Ayehu, 2024. "Multi-Temporal Passive and Active Remote Sensing for Agricultural Mapping and Acreage Estimation in Context of Small Farm Holds in Ethiopia," Land, MDPI, vol. 13(3), pages 1-29, March.
    13. Junming Liu & Mingfei Teng & Weiwei Chen & Hui Xiong, 2023. "A Cost-Effective Sequential Route Recommender System for Taxi Drivers," INFORMS Journal on Computing, INFORMS, vol. 35(5), pages 1098-1119, September.
    14. Simon Sosvilla-Rivero & Pedro Rodriguez, 2010. "Linkages in international stock markets: evidence from a classification procedure," Applied Economics, Taylor & Francis Journals, vol. 42(16), pages 2081-2089.
    15. Nahushananda Chakravarthy H G & Karthik M Seenappa & Sujay Raghavendra Naganna & Dayananda Pruthviraja, 2023. "Machine Learning Models for the Prediction of the Compressive Strength of Self-Compacting Concrete Incorporating Incinerated Bio-Medical Waste Ash," Sustainability, MDPI, vol. 15(18), pages 1-22, September.
    16. Marlene A. Smith & Murray J. Côté, 2022. "Predictive Analytics Improves Sales Forecasts for a Pop-up Retailer," Interfaces, INFORMS, vol. 52(4), pages 379-389, July.
    17. Tim Voigt & Martin Kohlhase & Oliver Nelles, 2021. "Incremental DoE and Modeling Methodology with Gaussian Process Regression: An Industrially Applicable Approach to Incorporate Expert Knowledge," Mathematics, MDPI, vol. 9(19), pages 1-26, October.
    18. Wen, Shaoting & Buyukada, Musa & Evrendilek, Fatih & Liu, Jingyong, 2020. "Uncertainty and sensitivity analyses of co-combustion/pyrolysis of textile dyeing sludge and incense sticks: Regression and machine-learning models," Renewable Energy, Elsevier, vol. 151(C), pages 463-474.
    19. Zhu, Haibin & Bai, Lu & He, Lidan & Liu, Zhi, 2023. "Forecasting realized volatility with machine learning: Panel data perspective," Journal of Empirical Finance, Elsevier, vol. 73(C), pages 251-271.
    20. Spiliotis, Evangelos & Makridakis, Spyros & Kaltsounis, Anastasios & Assimakopoulos, Vassilios, 2021. "Product sales probabilistic forecasting: An empirical evaluation using the M5 competition data," International Journal of Production Economics, Elsevier, vol. 240(C).

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:jinfst:v:73:y:2022:i:1:p:58-69. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: http://www.asis.org .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.