IDEAS home Printed from https://ideas.repec.org/a/gam/jmathe/v9y2021i18p2247-d634318.html
   My bibliography  Save this article

Subsampling and Aggregation: A Solution to the Scalability Problem in Distance-Based Prediction for Mixed-Type Data

Author

Listed:
  • Amparo Baíllo

    (Departamento de Matemáticas, Universidad Autónoma de Madrid, 28049 Madrid, Spain
    These authors contributed equally to this work.)

  • Aurea Grané

    (Statistics Department, Universidad Carlos III de Madrid, 28903 Getafe, Spain
    These authors contributed equally to this work.)

Abstract

The distance-based linear model (DB-LM) extends the classical linear regression to the framework of mixed-type predictors or when the only available information is a distance matrix between regressors (as it sometimes happens with big data). The main drawback of these DB methods is their computational cost, particularly due to the eigendecomposition of the Gram matrix. In this context, ensemble regression techniques provide a useful alternative to fitting the model to the whole sample. This work analyzes the performance of three subsampling and aggregation techniques in DB regression on two specific large, real datasets. We also analyze, via simulations, the performance of bagging and DB logistic regression in the classification problem with mixed-type features and large sample sizes.

Suggested Citation

  • Amparo Baíllo & Aurea Grané, 2021. "Subsampling and Aggregation: A Solution to the Scalability Problem in Distance-Based Prediction for Mixed-Type Data," Mathematics, MDPI, vol. 9(18), pages 1-17, September.
  • Handle: RePEc:gam:jmathe:v:9:y:2021:i:18:p:2247-:d:634318
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2227-7390/9/18/2247/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2227-7390/9/18/2247/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Grané, Aurea & Salini, Silvia & Verdolini, Elena, 2021. "Robust multivariate analysis for mixed-type data: Novel algorithm and its practical application in socio-economic research," Socio-Economic Planning Sciences, Elsevier, vol. 73(C).
    2. Eva Boj & Adrià Caballé & Pedro Delicado & Anna Esteve & Josep Fortiana, 2016. "Global and local distance-based generalized linear models," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 25(1), pages 170-195, March.
    3. A. R. de Leon & A. Soo & T. Williamson, 2011. "Classification with discrete and continuous variables via general mixed-data models," Journal of Applied Statistics, Taylor & Francis Journals, vol. 38(5), pages 1021-1032, February.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Alban Mbina Mbina & Guy Martial Nkiet & Fulgence Eyi Obiang, 2019. "Variable selection in discriminant analysis for mixed continuous-binary variables and several groups," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 13(3), pages 773-795, September.
    2. Leila Amiri & Mojtaba Khazaei & Mojtaba Ganjali, 2017. "General location model with factor analyzer covariance matrix structure and its applications," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 11(3), pages 593-609, September.
    3. Miguel Angel Ortíz-Barrios & Matias Garcia-Constantino & Chris Nugent & Isaac Alfaro-Sarmiento, 2022. "A Novel Integration of IF-DEMATEL and TOPSIS for the Classifier Selection Problem in Assistive Technology Adoption for People with Dementia," IJERPH, MDPI, vol. 19(3), pages 1-31, January.
    4. Peiró-Signes, Ángel & Segarra-Oña, Marival & Trull-Domínguez, Óscar & Sánchez-Planelles, Joaquín, 2022. "Exposing the ideal combination of endogenous–exogenous drivers for companies’ ecoinnovative orientation: Results from machine-learning methods," Socio-Economic Planning Sciences, Elsevier, vol. 79(C).
    5. Beibei Yuan & Willem Heiser & Mark Rooij, 2019. "The δ-Machine: Classification Based on Distances Towards Prototypes," Journal of Classification, Springer;The Classification Society, vol. 36(3), pages 442-470, October.
    6. S. Barahona & P. Centella & X. Gual-Arnau & M. V. Ibáñez & A. Simó, 2020. "Supervised classification of geometrical objects by integrating currents and functional data analysis," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 29(3), pages 637-660, September.
    7. Galati, Antonino & Coticchio, Alessandro & Peiró-Signes, Ángel, 2023. "Identifying the factors affecting citizens' willingness to participate in urban forest governance: Evidence from the municipality of Palermo, Italy," Forest Policy and Economics, Elsevier, vol. 155(C).
    8. Aurea Grané & Alpha A. Sow-Barry, 2021. "Visualizing Profiles of Large Datasets of Weighted and Mixed Data," Mathematics, MDPI, vol. 9(8), pages 1-20, April.
    9. Bhat, Chandra R., 2015. "A new generalized heterogeneous data model (GHDM) to jointly model mixed types of dependent variables," Transportation Research Part B: Methodological, Elsevier, vol. 79(C), pages 50-77.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:9:y:2021:i:18:p:2247-:d:634318. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.