IDEAS home Printed from https://ideas.repec.org/a/gam/jmathe/v10y2022i11p1812-d823692.html
   My bibliography  Save this article

Using Locality-Sensitive Hashing for SVM Classification of Large Data Sets

Author

Listed:
  • Maria D. Gonzalez-Lima

    (Departamento de Matemáticas y Estadística, Universidad del Norte, Barranquilla 081007, Colombia
    These authors contributed equally to this work.)

  • Carenne C. Ludeña

    (Matrix CPM Solutions, Crr 15 93A 84, Bogotá 110221, Colombia
    These authors contributed equally to this work.)

Abstract

We propose a novel method using Locality-Sensitive Hashing (LSH) for solving the optimization problem that arises in the training stage of support vector machines for large data sets, possibly in high dimensions. LSH was introduced as an efficient way to look for neighbors in high dimensional spaces. Random projections-based LSH functions create bins so that when great probability points belonging to the same bin are close, the points that are far will not be in the same bin. Based on these bins, it is not necessary to consider the whole original set but representatives in each one of them, thus reducing the effective size of the data set. A key of our proposal is that we work with the feature space and use only the projections to search for closeness in this space. Moreover, instead of choosing the projection directions at random, we sample a small subset and solve the associated SVM problem. Projections in this direction allows for a more precise sample in many cases and an approximation of the solution of the large problem is found in a fraction of the running time with small degradation of the classification error. We present two algorithms, theoretical support, and numerical experiments showing their performances on real life problems taken from the LIBSVM data base.

Suggested Citation

  • Maria D. Gonzalez-Lima & Carenne C. Ludeña, 2022. "Using Locality-Sensitive Hashing for SVM Classification of Large Data Sets," Mathematics, MDPI, vol. 10(11), pages 1-21, May.
  • Handle: RePEc:gam:jmathe:v:10:y:2022:i:11:p:1812-:d:823692
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2227-7390/10/11/1812/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2227-7390/10/11/1812/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. S. Camelo & M. González-Lima & A. Quiroz, 2015. "Nearest neighbors methods for support vector machines," Annals of Operations Research, Springer, vol. 235(1), pages 85-101, December.
    2. Karatzoglou, Alexandros & Smola, Alexandros & Hornik, Kurt & Zeileis, Achim, 2004. "kernlab - An S4 Package for Kernel Methods in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 11(i09).
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Junya Tang & Kuo-Yi Lin & Li Li, 2022. "Using Domain Adaptation for Incremental SVM Classification of Drift Data," Mathematics, MDPI, vol. 10(19), pages 1-17, September.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Tsukioka, Yasutomo & Yanagi, Junya & Takada, Teruko, 2018. "Investor sentiment extracted from internet stock message boards and IPO puzzles," International Review of Economics & Finance, Elsevier, vol. 56(C), pages 205-217.
    2. Andrea S Martinez-Vernon & James A Covington & Ramesh P Arasaradnam & Siavash Esfahani & Nicola O’Connell & Ioannis Kyrou & Richard S Savage, 2018. "An improved machine learning pipeline for urinary volatiles disease detection: Diagnosing diabetes," PLOS ONE, Public Library of Science, vol. 13(9), pages 1-20, September.
    3. Madhumita Sahoo & Aman Kasot & Anirban Dhar & Amlanjyoti Kar, 2018. "On Predictability of Groundwater Level in Shallow Wells Using Satellite Observations," Water Resources Management: An International Journal, Published for the European Water Resources Association (EWRA), Springer;European Water Resources Association (EWRA), vol. 32(4), pages 1225-1244, March.
    4. P. J. Zarco-Tejada & T. Poblete & C. Camino & V. Gonzalez-Dugo & R. Calderon & A. Hornero & R. Hernandez-Clemente & M. Román-Écija & M. P. Velasco-Amo & B. B. Landa & P. S. A. Beck & M. Saponari & D. , 2021. "Divergent abiotic spectral pathways unravel pathogen stress signals across species," Nature Communications, Nature, vol. 12(1), pages 1-11, December.
    5. Grubinger, Thomas & Zeileis, Achim & Pfeiffer, Karl-Peter, 2014. "evtree: Evolutionary Learning of Globally Optimal Classification and Regression Trees in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 61(i01).
    6. Uwe Ligges & Sebastian Krey, 2011. "Feature clustering for instrument classification," Computational Statistics, Springer, vol. 26(2), pages 279-291, June.
    7. Arnout Van Messem & Andreas Christmann, 2010. "A review on consistency and robustness properties of support vector machines for heavy-tailed distributions," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 4(2), pages 199-220, September.
    8. Ana Patrícia Rocha & Hugo Miguel Pereira Choupina & Maria do Carmo Vilas-Boas & José Maria Fernandes & João Paulo Silva Cunha, 2018. "System for automatic gait analysis based on a single RGB-D camera," PLOS ONE, Public Library of Science, vol. 13(8), pages 1-24, August.
    9. Huisheng Wu & Maogui Hu & Yaping Zhang & Yuan Han, 2021. "An Empirical Mode Decomposition for Establishing Spatiotemporal Air Quality Trends in Shandong Province, China," Sustainability, MDPI, vol. 13(22), pages 1-10, November.
    10. Shaobo Jin & Sebastian Ankargren, 2019. "Frequentist Model Averaging in Structural Equation Modelling," Psychometrika, Springer;The Psychometric Society, vol. 84(1), pages 84-104, March.
    11. Tyler C Shimko & Erik C Andersen, 2014. "COPASutils: An R Package for Reading, Processing, and Visualizing Data from COPAS Large-Particle Flow Cytometers," PLOS ONE, Public Library of Science, vol. 9(10), pages 1-5, October.
    12. Zulj, Valentin & Jin, Shaobo, 2024. "Can model averaging improve propensity score based estimation of average treatment effects?," Working Paper Series 2024:1, IFAU - Institute for Evaluation of Labour Market and Education Policy.
    13. Tobias Rentschler & Philipp Gries & Thorsten Behrens & Helge Bruelheide & Peter Kühn & Steffen Seitz & Xuezheng Shi & Stefan Trogisch & Thomas Scholten & Karsten Schmidt, 2019. "Comparison of catchment scale 3D and 2.5D modelling of soil organic carbon stocks in Jiangxi Province, PR China," PLOS ONE, Public Library of Science, vol. 14(8), pages 1-23, August.
    14. Cipollini, Francesca & Oneto, Luca & Coraddu, Andrea & Murphy, Alan John & Anguita, Davide, 2018. "Condition-based maintenance of naval propulsion systems: Data analysis with minimal feedback," Reliability Engineering and System Safety, Elsevier, vol. 177(C), pages 12-23.
    15. Paolo Gambetti & Francesco Roccazzella & Frédéric Vrins, 2022. "Meta-Learning Approaches for Recovery Rate Prediction," Risks, MDPI, vol. 10(6), pages 1-29, June.
    16. Hermel Homburger & Manuel K Schneider & Sandra Hilfiker & Andreas Lüscher, 2014. "Inferring Behavioral States of Grazing Livestock from High-Frequency Position Data Alone," PLOS ONE, Public Library of Science, vol. 9(12), pages 1-22, December.
    17. Takahiro Takamatsu & Hideaki Ohtake & Takashi Oozeki, 2022. "Support Vector Quantile Regression for the Post-Processing of Meso-Scale Ensemble Prediction System Data in the Kanto Region: Solar Power Forecast Reducing Overestimation," Energies, MDPI, vol. 15(4), pages 1-18, February.
    18. Sven Husmann & Antoniya Shivarova & Rick Steinert, 2020. "Company classification using machine learning," Papers 2004.01496, arXiv.org, revised May 2020.
    19. Rachel Sippy & Daniel F Farrell & Daniel A Lichtenstein & Ryan Nightingale & Megan A Harris & Joseph Toth & Paris Hantztidiamantis & Nicholas Usher & Cinthya Cueva Aponte & Julio Barzallo Aguilar & An, 2020. "Severity Index for Suspected Arbovirus (SISA): Machine learning for accurate prediction of hospitalization in subjects suspected of arboviral infection," PLOS Neglected Tropical Diseases, Public Library of Science, vol. 14(2), pages 1-20, February.
    20. Bellotti, Anthony & Brigo, Damiano & Gambetti, Paolo & Vrins, Frédéric, 2021. "Forecasting recovery rates on non-performing loans with machine learning," International Journal of Forecasting, Elsevier, vol. 37(1), pages 428-444.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:10:y:2022:i:11:p:1812-:d:823692. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.