IDEAS home Printed from https://ideas.repec.org/a/gam/jmathe/v10y2022i11p1812-d823692.html
   My bibliography  Save this article

Using Locality-Sensitive Hashing for SVM Classification of Large Data Sets

Author

Listed:
  • Maria D. Gonzalez-Lima

    (Departamento de Matemáticas y Estadística, Universidad del Norte, Barranquilla 081007, Colombia
    These authors contributed equally to this work.)

  • Carenne C. Ludeña

    (Matrix CPM Solutions, Crr 15 93A 84, Bogotá 110221, Colombia
    These authors contributed equally to this work.)

Abstract

We propose a novel method using Locality-Sensitive Hashing (LSH) for solving the optimization problem that arises in the training stage of support vector machines for large data sets, possibly in high dimensions. LSH was introduced as an efficient way to look for neighbors in high dimensional spaces. Random projections-based LSH functions create bins so that when great probability points belonging to the same bin are close, the points that are far will not be in the same bin. Based on these bins, it is not necessary to consider the whole original set but representatives in each one of them, thus reducing the effective size of the data set. A key of our proposal is that we work with the feature space and use only the projections to search for closeness in this space. Moreover, instead of choosing the projection directions at random, we sample a small subset and solve the associated SVM problem. Projections in this direction allows for a more precise sample in many cases and an approximation of the solution of the large problem is found in a fraction of the running time with small degradation of the classification error. We present two algorithms, theoretical support, and numerical experiments showing their performances on real life problems taken from the LIBSVM data base.

Suggested Citation

  • Maria D. Gonzalez-Lima & Carenne C. Ludeña, 2022. "Using Locality-Sensitive Hashing for SVM Classification of Large Data Sets," Mathematics, MDPI, vol. 10(11), pages 1-21, May.
  • Handle: RePEc:gam:jmathe:v:10:y:2022:i:11:p:1812-:d:823692
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2227-7390/10/11/1812/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2227-7390/10/11/1812/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. S. Camelo & M. González-Lima & A. Quiroz, 2015. "Nearest neighbors methods for support vector machines," Annals of Operations Research, Springer, vol. 235(1), pages 85-101, December.
    2. Karatzoglou, Alexandros & Smola, Alexandros & Hornik, Kurt & Zeileis, Achim, 2004. "kernlab - An S4 Package for Kernel Methods in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 11(i09).
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Junya Tang & Kuo-Yi Lin & Li Li, 2022. "Using Domain Adaptation for Incremental SVM Classification of Drift Data," Mathematics, MDPI, vol. 10(19), pages 1-17, September.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Tsukioka, Yasutomo & Yanagi, Junya & Takada, Teruko, 2018. "Investor sentiment extracted from internet stock message boards and IPO puzzles," International Review of Economics & Finance, Elsevier, vol. 56(C), pages 205-217.
    2. Daniel J. Luckett & Eric B. Laber & Samer S. El‐Kamary & Cheng Fan & Ravi Jhaveri & Charles M. Perou & Fatma M. Shebl & Michael R. Kosorok, 2021. "Receiver operating characteristic curves and confidence bands for support vector machines," Biometrics, The International Biometric Society, vol. 77(4), pages 1422-1430, December.
    3. Grabisch, Michel & Kojadinovic, Ivan & Meyer, Patrick, 2008. "A review of methods for capacity identification in Choquet integral based multi-attribute utility theory: Applications of the Kappalab R package," European Journal of Operational Research, Elsevier, vol. 186(2), pages 766-785, April.
    4. Bellotti, Anthony & Brigo, Damiano & Gambetti, Paolo & Vrins, Frédéric, 2021. "Forecasting recovery rates on non-performing loans with machine learning," International Journal of Forecasting, Elsevier, vol. 37(1), pages 428-444.
    5. Riza, Lala Septem & Bergmeir, Christoph & Herrera, Francisco & Benítez, José M., 2015. "frbs: Fuzzy Rule-Based Systems for Classification and Regression in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 65(i06).
    6. Karin Wolffhechel & Amanda C Hahn & Hanne Jarmer & Claire I Fisher & Benedict C Jones & Lisa M DeBruine, 2015. "Testing the Utility of a Data-Driven Approach for Assessing BMI from Face Images," PLOS ONE, Public Library of Science, vol. 10(10), pages 1-10, October.
    7. Andrea S Martinez-Vernon & James A Covington & Ramesh P Arasaradnam & Siavash Esfahani & Nicola O’Connell & Ioannis Kyrou & Richard S Savage, 2018. "An improved machine learning pipeline for urinary volatiles disease detection: Diagnosing diabetes," PLOS ONE, Public Library of Science, vol. 13(9), pages 1-20, September.
    8. Khamma, Thulasi Ram & Zhang, Yuming & Guerrier, Stéphane & Boubekri, Mohamed, 2020. "Generalized additive models: An efficient method for short-term energy prediction in office buildings," Energy, Elsevier, vol. 213(C).
    9. Madhumita Sahoo & Aman Kasot & Anirban Dhar & Amlanjyoti Kar, 2018. "On Predictability of Groundwater Level in Shallow Wells Using Satellite Observations," Water Resources Management: An International Journal, Published for the European Water Resources Association (EWRA), Springer;European Water Resources Association (EWRA), vol. 32(4), pages 1225-1244, March.
    10. P. J. Zarco-Tejada & T. Poblete & C. Camino & V. Gonzalez-Dugo & R. Calderon & A. Hornero & R. Hernandez-Clemente & M. Román-Écija & M. P. Velasco-Amo & B. B. Landa & P. S. A. Beck & M. Saponari & D. , 2021. "Divergent abiotic spectral pathways unravel pathogen stress signals across species," Nature Communications, Nature, vol. 12(1), pages 1-11, December.
    11. Grubinger, Thomas & Zeileis, Achim & Pfeiffer, Karl-Peter, 2014. "evtree: Evolutionary Learning of Globally Optimal Classification and Regression Trees in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 61(i01).
    12. Uwe Ligges & Sebastian Krey, 2011. "Feature clustering for instrument classification," Computational Statistics, Springer, vol. 26(2), pages 279-291, June.
    13. Arnout Van Messem & Andreas Christmann, 2010. "A review on consistency and robustness properties of support vector machines for heavy-tailed distributions," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 4(2), pages 199-220, September.
    14. Jacobi Liana & Kwok Chun Fung & Ramírez-Hassan Andrés & Nghiem Nhung, 2024. "Posterior Manifolds over Prior Parameter Regions: Beyond Pointwise Sensitivity Assessments for Posterior Statistics from MCMC Inference," Studies in Nonlinear Dynamics & Econometrics, De Gruyter, vol. 28(2), pages 403-434, April.
    15. Nunes, Matthew, 2015. "Statistical Analysis of Network Data with R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 66(b01).
    16. Ana Patrícia Rocha & Hugo Miguel Pereira Choupina & Maria do Carmo Vilas-Boas & José Maria Fernandes & João Paulo Silva Cunha, 2018. "System for automatic gait analysis based on a single RGB-D camera," PLOS ONE, Public Library of Science, vol. 13(8), pages 1-24, August.
    17. Samir K. Safi & Sheema Gul, 2024. "An Enhanced Tree Ensemble for Classification in the Presence of Extreme Class Imbalance," Mathematics, MDPI, vol. 12(20), pages 1-17, October.
    18. Maria-Carmen García-Centeno & Román Mínguez-Salido & Raúl del Pozo-Rubio, 2021. "The Classification of Profiles of Financial Catastrophe Caused by Out-of-Pocket Payments: A Methodological Approach," Mathematics, MDPI, vol. 9(11), pages 1-20, May.
    19. Yasset Perez-Riverol & Max Kuhn & Juan Antonio Vizcaíno & Marc-Phillip Hitz & Enrique Audain, 2017. "Accurate and fast feature selection workflow for high-dimensional omics data," PLOS ONE, Public Library of Science, vol. 12(12), pages 1-14, December.
    20. Heungsun Hwang & Gyeongcheol Cho, 2020. "Global Least Squares Path Modeling: A Full-Information Alternative to Partial Least Squares Path Modeling," Psychometrika, Springer;The Psychometric Society, vol. 85(4), pages 947-972, December.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:10:y:2022:i:11:p:1812-:d:823692. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.