IDEAS home Printed from https://ideas.repec.org/a/inm/orijds/v2y2023i2p99-115.html
   My bibliography  Save this article

A Nonparametric Subspace Analysis Approach with Application to Anomaly Detection Ensembles

Author

Listed:
  • Irad Ben-Gal

    (Department of Industrial Engineering, Tel Aviv University, 69978 Tel Aviv, Israel)

  • Marcelo Bacher

    (Department of Industrial Engineering, Tel Aviv University, 69978 Tel Aviv, Israel)

  • Morris Amara

    (Department of Industrial Engineering, Tel Aviv University, 69978 Tel Aviv, Israel)

  • Erez Shmueli

    (Department of Industrial Engineering, Tel Aviv University, 69978 Tel Aviv, Israel)

Abstract

Identifying anomalies in multidimensional data sets is an important yet challenging task in many real-world applications. A special case arises when anomalies are occluded in a small subset of attributes. We propose a new subspace analysis approach, called agglomerative attribute grouping (AAG), that searches for subspaces composed of highly correlative (in the general sense) attributes. Such correlations among attributes can better reflect the behavior of normal observations and hence, can be used to improve the identification of abnormal data samples. The proposed AAG algorithm relies on a generalized multiattribute measure (derived from information theory measures over attributes’ partitions) for evaluating the “information distance” among various subsets of attributes. To determine the set of subspaces, AAG applies a variation of the well-known agglomerative clustering algorithm with the proposed measure as the underlying distance function, whereas in contrast to existing methods, AAG does not require any tuning of parameters. Finally, the set of informative subspaces can be used to improve subspace-based analytical tasks, such as anomaly detection, novelty detection, forecasting, and clustering. Extensive evaluation over real-world data sets demonstrates that (i) in the vast majority of cases, AAG outperforms both classical and state-of-the-art subspace analysis methods when used in anomaly and novelty detection ensembles; (ii) it often generates fewer subspaces with fewer attributes each, thus resulting in faster training times for the anomaly and novelty detection ensemble; and (iii) the generated subspaces can also be useful in other analytical tasks, such as clustering and forecasting.

Suggested Citation

  • Irad Ben-Gal & Marcelo Bacher & Morris Amara & Erez Shmueli, 2023. "A Nonparametric Subspace Analysis Approach with Application to Anomaly Detection Ensembles," INFORMS Joural on Data Science, INFORMS, vol. 2(2), pages 99-115, October.
  • Handle: RePEc:inm:orijds:v:2:y:2023:i:2:p:99-115
    DOI: 10.1287/ijds.2023.0027
    as

    Download full text from publisher

    File URL: http://dx.doi.org/10.1287/ijds.2023.0027
    Download Restriction: no

    File URL: https://libkey.io/10.1287/ijds.2023.0027?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. William McGill, 1954. "Multivariate information transmission," Psychometrika, Springer;The Psychometric Society, vol. 19(2), pages 97-116, June.
    2. Sugar, Catherine A. & James, Gareth M., 2003. "Finding the Number of Clusters in a Dataset: An Information-Theoretic Approach," Journal of the American Statistical Association, American Statistical Association, vol. 98, pages 750-763, January.
    3. Chiwoo Park & Jianhua Z. Huang & Yu Ding, 2010. "A Computable Plug-In Estimator of Minimum Volume Sets for Novelty Detection," Operations Research, INFORMS, vol. 58(5), pages 1469-1480, October.
    4. Eugene Kagan & Irad Ben-gal, 2014. "A group testing algorithm with online informational learning," IISE Transactions, Taylor & Francis Journals, vol. 46(2), pages 164-184.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. repec:hig:wpaper:98sti2019 is not listed on IDEAS
    2. Petersen, Alexander M. & Rotolo, Daniele & Leydesdorff, Loet, 2016. "A triple helix model of medical innovation: Supply, demand, and technological capabilities in terms of Medical Subject Headings," Research Policy, Elsevier, vol. 45(3), pages 666-681.
    3. Li, Pai-Ling & Chiou, Jeng-Min, 2011. "Identifying cluster number for subspace projected functional data clustering," Computational Statistics & Data Analysis, Elsevier, vol. 55(6), pages 2090-2103, June.
    4. Yujia Li & Xiangrui Zeng & Chien‐Wei Lin & George C. Tseng, 2022. "Simultaneous estimation of cluster number and feature sparsity in high‐dimensional cluster analysis," Biometrics, The International Biometric Society, vol. 78(2), pages 574-585, June.
    5. Park, Han Woo & Leydesdorff, Loet, 2010. "Longitudinal trends in networks of university-industry-government relations in South Korea: The role of programmatic incentives," Research Policy, Elsevier, vol. 39(5), pages 640-649, June.
    6. Songyot Nakariyakul, 2019. "A hybrid gene selection algorithm based on interaction information for microarray-based cancer classification," PLOS ONE, Public Library of Science, vol. 14(2), pages 1-17, February.
    7. Louis Verny & Nadir Sella & Séverine Affeldt & Param Priya Singh & Hervé Isambert, 2017. "Learning causal networks with latent variables from multivariate information in genomic data," PLOS Computational Biology, Public Library of Science, vol. 13(10), pages 1-25, October.
    8. Qiang Ji & Dayong Zhang & Yuqian Zhao, 2022. "Intra-day co-movements of crude oil futures: China and the international benchmarks," Annals of Operations Research, Springer, vol. 313(1), pages 77-103, June.
    9. Xiaojun Hu & Xian Li & Ronald Rousseau, 2021. "Mathematical reflections on Triple Helix calculations," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(10), pages 8581-8587, October.
    10. Inga A. Ivanova & Loet Leydesdorff, 2014. "A simulation model of the Triple Helix of university–industry–government relations and the decomposition of the redundancy," Scientometrics, Springer;Akadémiai Kiadó, vol. 99(3), pages 927-948, June.
    11. Loet Leydesdorff & Han Woo Park & Balazs Lengyel, 2014. "A routine for measuring synergy in university–industry–government relations: mutual information as a Triple-Helix and Quadruple-Helix indicator," Scientometrics, Springer;Akadémiai Kiadó, vol. 99(1), pages 27-35, April.
    12. Marianna Mauro & Monica Giancotti & Giovanna Talarico, 2017. "Mapping the field: A bibliometric analysis of accountability literature in healthcare," MECOSAN, FrancoAngeli Editore, vol. 2017(101), pages 7-30.
    13. Kondo, Yumi & Salibian-Barrera, Matias & Zamar, Ruben, 2016. "RSKC: An R Package for a Robust and Sparse K-Means Clustering Algorithm," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 72(i05).
    14. Mariusz Kubkowski & Jan Mielniczuk, 2021. "Asymptotic Distributions of Empirical Interaction Information," Methodology and Computing in Applied Probability, Springer, vol. 23(1), pages 291-315, March.
    15. Jaković Božidar & Ćurlin Tamara & Miloloža Ivan, 2021. "Enterprise Digital Divide: Website e-Commerce Functionalities among European Union Enterprises," Business Systems Research, Sciendo, vol. 12(1), pages 197-215, May.
    16. J. Fernando Vera & Rodrigo Macías, 2021. "On the Behaviour of K-Means Clustering of a Dissimilarity Matrix by Means of Full Multidimensional Scaling," Psychometrika, Springer;The Psychometric Society, vol. 86(2), pages 489-513, June.
    17. Oliver Schaer & Nikolaos Kourentzes & Robert Fildes, 2022. "Predictive competitive intelligence with prerelease online search traffic," Production and Operations Management, Production and Operations Management Society, vol. 31(10), pages 3823-3839, October.
    18. Loet Leydesdorff, 2011. "“Structuration” by intellectual organization: the configuration of knowledge in relations among structural components in networks of science," Scientometrics, Springer;Akadémiai Kiadó, vol. 88(2), pages 499-520, August.
    19. Lengyel, Balázs & Leydesdorff, Loet, 2015. "The Effects of FDI on Innovation Systems in Hungarian Regions: Where is the Synergy Generated?," MPRA Paper 73945, University Library of Munich, Germany.
    20. Fang, Yixin & Wang, Junhui, 2011. "Penalized cluster analysis with applications to family data," Computational Statistics & Data Analysis, Elsevier, vol. 55(6), pages 2128-2136, June.
    21. J. Vera & Rodrigo Macías & Willem Heiser, 2013. "Cluster Differences Unfolding for Two-Way Two-Mode Preference Rating Data," Journal of Classification, Springer;The Classification Society, vol. 30(3), pages 370-396, October.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:inm:orijds:v:2:y:2023:i:2:p:99-115. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Chris Asher (email available below). General contact details of provider: https://edirc.repec.org/data/inforea.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.