IDEAS home Printed from https://ideas.repec.org/a/gam/jmathe/v11y2023i1p228-d1022691.html
   My bibliography  Save this article

Statistical Depth for Text Data: An Application to the Classification of Healthcare Data

Author

Listed:
  • Sergio Bolívar

    (Department of Mathematics, Statistics and Computer Science, Universidad de Cantabria, 39005 Santander, Spain)

  • Alicia Nieto-Reyes

    (Department of Mathematics, Statistics and Computer Science, Universidad de Cantabria, 39005 Santander, Spain)

  • Heather L. Rogers

    (Biocruces Bizkaia Health Research Institute, 48903 Barakaldo, Spain
    IKERBASQUE, Basque Foundation for Science, 48013 Bilbao, Spain)

Abstract

This manuscript introduces a new concept of statistical depth function: the compositional D -depth. It is the first data depth developed exclusively for text data, in particular, for those data vectorized according to a frequency-based criterion, such as the tf-idf (term frequency–inverse document frequency) statistic, which results in most vector entries taking a value of zero. The proposed data depth consists of considering the inverse discrete Fourier transform of the vectorized text fragments and then applying a statistical depth for functional data, D . This depth is intended to address the problem of sparsity of numerical features resulting from the transformation of qualitative text data into quantitative data, which is a common procedure in most natural language processing frameworks. Indeed, this sparsity hinders the use of traditional statistical depths and machine learning techniques for classification purposes. In order to demonstrate the potential value of this new proposal, it is applied to a real-world case study which involves mapping Consolidated Framework for Implementation and Research (CFIR) constructs to qualitative healthcare data. It is shown that the D D G -classifier yields competitive results and outperforms all studied traditional machine learning techniques (logistic regression with LASSO regularization, artificial neural networks, decision trees, and support vector machines) when used in combination with the newly defined compositional D -depth.

Suggested Citation

  • Sergio Bolívar & Alicia Nieto-Reyes & Heather L. Rogers, 2023. "Statistical Depth for Text Data: An Application to the Classification of Healthcare Data," Mathematics, MDPI, vol. 11(1), pages 1-20, January.
  • Handle: RePEc:gam:jmathe:v:11:y:2023:i:1:p:228-:d:1022691
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2227-7390/11/1/228/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2227-7390/11/1/228/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Jun Li & Juan A. Cuesta-Albertos & Regina Y. Liu, 2012. "DD -Classifier: Nonparametric Classification Procedure Based on DD -Plot," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 107(498), pages 737-753, June.
    2. Sergio Bolívar & Alicia Nieto-Reyes & Heather L. Rogers, 2022. "Supervised Classification of Healthcare Text Data Based on Context-Defined Categories," Mathematics, MDPI, vol. 10(12), pages 1-31, June.
    3. Daniel Hlubinka & Irène Gijbels & Marek Omelka & Stanislav Nagy, 2015. "Integrated data depth for smooth functions and its application in supervised classification," Computational Statistics, Springer, vol. 30(4), pages 1011-1031, December.
    4. Hornik, Kurt & Feinerer, Ingo & Kober, Martin & Buchta, Christian, 2012. "Spherical k-Means Clustering," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 50(i10).
    5. Antonio Cuevas & Manuel Febrero & Ricardo Fraiman, 2007. "Robust estimation and classification for functional data via projection-based depth notions," Computational Statistics, Springer, vol. 22(3), pages 481-496, September.
    6. Ricardo Fraiman & Graciela Muniz, 2001. "Trimmed means for functional data," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 10(2), pages 419-440, December.
    7. Cuesta-Albertos, J.A. & Nieto-Reyes, A., 2008. "The random Tukey depth," Computational Statistics & Data Analysis, Elsevier, vol. 52(11), pages 4979-4988, July.
    8. Noha Alnazzawi & Najlaa Alsaedi & Fahad Alharbi & Najla Alaswad, 2022. "Using Social Media to Detect Fake News Information Related to Product Marketing: The FakeAds Corpus," Data, MDPI, vol. 7(4), pages 1-13, April.
    9. Christopher Haynes & Marco A. Palomino & Liz Stuart & David Viira & Frances Hannon & Gemma Crossingham & Kate Tantam, 2022. "Automatic Classification of National Health Service Feedback," Mathematics, MDPI, vol. 10(6), pages 1-23, March.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Manlin Chen & Zhijie Zhou & Xiaoxia Han & Zhichao Feng, 2023. "A Text-Oriented Fault Diagnosis Method for Electromechanical Device Based on Belief Rule Base," Mathematics, MDPI, vol. 11(8), pages 1-25, April.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Carlo Sguera & Sara López-Pintado, 2021. "A notion of depth for sparse functional data," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 30(3), pages 630-649, September.
    2. Karl Mosler & Pavlo Mozharovskyi, 2017. "Fast DD-classification of functional data," Statistical Papers, Springer, vol. 58(4), pages 1055-1089, December.
    3. repec:cte:wsrepe:24606 is not listed on IDEAS
    4. Carlo Sguera & Pedro Galeano & Rosa Lillo, 2014. "Spatial depth-based classification for functional data," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 23(4), pages 725-750, December.
    5. Nieto-Reyes, Alicia & Battey, Heather, 2021. "A topologically valid construction of depth for functional data," Journal of Multivariate Analysis, Elsevier, vol. 184(C).
    6. Miguel Flores & Salvador Naya & Rubén Fernández-Casal & Sonia Zaragoza & Paula Raña & Javier Tarrío-Saavedra, 2020. "Constructing a Control Chart Using Functional Data," Mathematics, MDPI, vol. 8(1), pages 1-26, January.
    7. Alba M. Franco-Pereira & Rosa E. Lillo, 2020. "Rank tests for functional data based on the epigraph, the hypograph and associated graphical representations," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 14(3), pages 651-676, September.
    8. Daniel Kosiorowski & Jerzy P. Rydlewski & Małgorzata Snarska, 2019. "Detecting a structural change in functional time series using local Wilcoxon statistic," Statistical Papers, Springer, vol. 60(5), pages 1677-1698, October.
    9. Fabrizio Maturo & Rosanna Verde, 2023. "Supervised classification of curves via a combined use of functional data analysis and tree-based methods," Computational Statistics, Springer, vol. 38(1), pages 419-459, March.
    10. López-Pintado, Sara & Romo, Juan, 2011. "A half-region depth for functional data," Computational Statistics & Data Analysis, Elsevier, vol. 55(4), pages 1679-1695, April.
    11. Serfling, Robert & Wijesuriya, Uditha, 2017. "Depth-based nonparametric description of functional data, with emphasis on use of spatial depth," Computational Statistics & Data Analysis, Elsevier, vol. 105(C), pages 24-45.
    12. Marco Grasso & Bianca Maria Colosimo & Fugee Tsung, 2017. "A phase I multi-modelling approach for profile monitoring of signal data," International Journal of Production Research, Taylor & Francis Journals, vol. 55(15), pages 4354-4377, August.
    13. Olusola Samuel Makinde, 2019. "Classification rules based on distribution functions of functional depth," Statistical Papers, Springer, vol. 60(3), pages 629-640, June.
    14. Cleveland, Jason & Zhao, Weilong & Wu, Wei, 2018. "Robust template estimation for functional data with phase variability using band depth," Computational Statistics & Data Analysis, Elsevier, vol. 125(C), pages 10-26.
    15. J. A. Cuesta-Albertos & M. Febrero-Bande & M. Oviedo de la Fuente, 2017. "The $$\hbox {DD}^G$$ DD G -classifier in the functional setting," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 26(1), pages 119-142, March.
    16. repec:cte:wsrepe:24615 is not listed on IDEAS
    17. Tian, Yahui & Gel, Yulia R., 2019. "Fusing data depth with complex networks: Community detection with prior information," Computational Statistics & Data Analysis, Elsevier, vol. 139(C), pages 99-116.
    18. Graciela Estévez-Pérez & Philippe Vieu, 2021. "A new way for ranking functional data with applications in diagnostic test," Computational Statistics, Springer, vol. 36(1), pages 127-154, March.
    19. Anirvan Chakraborty & Probal Chaudhuri, 2014. "On data depth in infinite dimensional spaces," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 66(2), pages 303-324, April.
    20. Elías, Antonio & Jiménez, Raúl & Shang, Han Lin, 2022. "On projection methods for functional time series forecasting," Journal of Multivariate Analysis, Elsevier, vol. 189(C).
    21. Agostinelli, Claudio, 2018. "Local half-region depth for functional data," Journal of Multivariate Analysis, Elsevier, vol. 163(C), pages 67-79.
    22. Mia Hubert & Peter Rousseeuw & Pieter Segaert, 2015. "Multivariate functional outlier detection," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 24(2), pages 177-202, July.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:11:y:2023:i:1:p:228-:d:1022691. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.