IDEAS home Printed from https://ideas.repec.org/a/spr/advdac/v18y2024i3d10.1007_s11634-023-00569-z.html
   My bibliography  Save this article

Variational inference for semiparametric Bayesian novelty detection in large datasets

Author

Listed:
  • Luca Benedetti

    (Politecnico di Milano)

  • Eric Boniardi

    (Politecnico di Milano)

  • Leonardo Chiani

    (Politecnico di Milano
    Fondazione Centro Euro-Mediterraneo sui Cambiamenti Climatici)

  • Jacopo Ghirri

    (Politecnico di Milano
    Fondazione Centro Euro-Mediterraneo sui Cambiamenti Climatici)

  • Marta Mastropietro

    (Politecnico di Milano
    Fondazione Centro Euro-Mediterraneo sui Cambiamenti Climatici)

  • Andrea Cappozzo

    (Politecnico di Milano)

  • Francesco Denti

    (Università Cattolica del Sacro Cuore)

Abstract

After being trained on a fully-labeled training set, where the observations are grouped into a certain number of known classes, novelty detection methods aim to classify the instances of an unlabeled test set while allowing for the presence of previously unseen classes. These models are valuable in many areas, ranging from social network and food adulteration analyses to biology, where an evolving population may be present. In this paper, we focus on a two-stage Bayesian semiparametric novelty detector, also known as Brand, recently introduced in the literature. Leveraging on a model-based mixture representation, Brand allows clustering the test observations into known training terms or a single novelty term. Furthermore, the novelty term is modeled with a Dirichlet Process mixture model to flexibly capture any departure from the known patterns. Brand was originally estimated using MCMC schemes, which are prohibitively costly when applied to high-dimensional data. To scale up Brand applicability to large datasets, we propose to resort to a variational Bayes approach, providing an efficient algorithm for posterior approximation. We demonstrate a significant gain in efficiency and excellent classification performance with thorough simulation studies. Finally, to showcase its applicability, we perform a novelty detection analysis using the openly-available Statlog dataset, a large collection of satellite imaging spectra, to search for novel soil types.

Suggested Citation

  • Luca Benedetti & Eric Boniardi & Leonardo Chiani & Jacopo Ghirri & Marta Mastropietro & Andrea Cappozzo & Francesco Denti, 2024. "Variational inference for semiparametric Bayesian novelty detection in large datasets," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 18(3), pages 681-703, September.
  • Handle: RePEc:spr:advdac:v:18:y:2024:i:3:d:10.1007_s11634-023-00569-z
    DOI: 10.1007/s11634-023-00569-z
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11634-023-00569-z
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11634-023-00569-z?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. David M. Blei & Alp Kucukelbir & Jon D. McAuliffe, 2017. "Variational Inference: A Review for Statisticians," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 112(518), pages 859-877, April.
    2. Michael Fop & Pierre-Alexandre Mattei & Charles Bouveyron & Thomas Brendan Murphy, 2022. "Unobserved classes and extra variables in high-dimensional discriminant analysis," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 16(1), pages 55-92, March.
    3. Kolyan Ray & Botond Szabó, 2022. "Variational Bayes for High-Dimensional Linear Regression With Sparse Priors," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 117(539), pages 1270-1281, September.
    4. Lawrence Hubert & Phipps Arabie, 1985. "Comparing partitions," Journal of Classification, Springer;The Classification Society, vol. 2(1), pages 193-218, December.
    5. Charles Bouveyron, 2014. "Adaptive Mixture Discriminant Analysis for Supervised Learning with Unobserved Classes," Journal of Classification, Springer;The Classification Society, vol. 31(1), pages 49-84, April.
    6. Ormerod, J. T. & Wand, M. P., 2010. "Explaining Variational Approximations," The American Statistician, American Statistical Association, vol. 64(2), pages 140-153.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Mengbing Li & Daniel E. Park & Maliha Aziz & Cindy M. Liu & Lance B. Price & Zhenke Wu, 2023. "Integrating sample similarities into latent class analysis: a tree‐structured shrinkage approach," Biometrics, The International Biometric Society, vol. 79(1), pages 264-279, March.
    2. Loaiza-Maya, Rubén & Smith, Michael Stanley & Nott, David J. & Danaher, Peter J., 2022. "Fast and accurate variational inference for models with many latent variables," Journal of Econometrics, Elsevier, vol. 230(2), pages 339-362.
    3. Youngseon Lee & Seongil Jo & Jaeyong Lee, 2022. "A variational inference for the Lévy adaptive regression with multiple kernels," Computational Statistics, Springer, vol. 37(5), pages 2493-2515, November.
    4. Gael M. Martin & David T. Frazier & Christian P. Robert, 2020. "Computing Bayes: Bayesian Computation from 1763 to the 21st Century," Monash Econometrics and Business Statistics Working Papers 14/20, Monash University, Department of Econometrics and Business Statistics.
    5. Gael M. Martin & David T. Frazier & Ruben Loaiza-Maya & Florian Huber & Gary Koop & John Maheu & Didier Nibbering & Anastasios Panagiotelis, 2023. "Bayesian Forecasting in the 21st Century: A Modern Review," Monash Econometrics and Business Statistics Working Papers 1/23, Monash University, Department of Econometrics and Business Statistics.
    6. Gary Koop & Dimitris Korobilis, 2023. "Bayesian Dynamic Variable Selection In High Dimensions," International Economic Review, Department of Economics, University of Pennsylvania and Osaka University Institute of Social and Economic Research Association, vol. 64(3), pages 1047-1074, August.
    7. Bansal, Prateek & Krueger, Rico & Graham, Daniel J., 2021. "Fast Bayesian estimation of spatial count data models," Computational Statistics & Data Analysis, Elsevier, vol. 157(C).
    8. Kazuhiro Yamaguchi & Kensuke Okada, 2020. "Variational Bayes Inference for the DINA Model," Journal of Educational and Behavioral Statistics, , vol. 45(5), pages 569-597, October.
    9. Korobilis, Dimitris & Koop, Gary, 2018. "Variational Bayes inference in high-dimensional time-varying parameter models," Essex Finance Centre Working Papers 22665, University of Essex, Essex Business School.
    10. Yuan Fang & Dimitris Karlis & Sanjeena Subedi, 2022. "Infinite Mixtures of Multivariate Normal-Inverse Gaussian Distributions for Clustering of Skewed Data," Journal of Classification, Springer;The Classification Society, vol. 39(3), pages 510-552, November.
    11. Daziano, Ricardo A., 2022. "Willingness to delay charging of electric vehicles," Research in Transportation Economics, Elsevier, vol. 94(C).
    12. Linda S. L. Tan, 2021. "Use of model reparametrization to improve variational Bayes," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 83(1), pages 30-57, February.
    13. Gael M. Martin & David T. Frazier & Christian P. Robert, 2021. "Approximating Bayes in the 21st Century," Monash Econometrics and Business Statistics Working Papers 24/21, Monash University, Department of Econometrics and Business Statistics.
    14. Gefang, Deborah & Koop, Gary & Poon, Aubrey, 2023. "Forecasting using variational Bayesian inference in large vector autoregressions with hierarchical shrinkage," International Journal of Forecasting, Elsevier, vol. 39(1), pages 346-363.
    15. Deborah Gefang & Gary Koop & Aubrey Poon, 2019. "Variational Bayesian Inference in Large Vector Autoregressions with Hierarchical Shrinkage," Economic Statistics Centre of Excellence (ESCoE) Discussion Papers ESCoE DP-2019-07, Economic Statistics Centre of Excellence (ESCoE).
    16. Gunawan, David & Kohn, Robert & Nott, David, 2021. "Variational Bayes approximation of factor stochastic volatility models," International Journal of Forecasting, Elsevier, vol. 37(4), pages 1355-1375.
    17. Bruno Jacobs & Dennis Fok & Bas Donkers, 2021. "Understanding Large-Scale Dynamic Purchase Behavior," Marketing Science, INFORMS, vol. 40(5), pages 844-870, September.
    18. Kazuhiro Yamaguchi, 2023. "Bayesian Analysis Methods for Two-Level Diagnosis Classification Models," Journal of Educational and Behavioral Statistics, , vol. 48(6), pages 773-809, December.
    19. Riccardo Rastelli & Michael Fop, 2020. "A stochastic block model for interaction lengths," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 14(2), pages 485-512, June.
    20. Michael Fop & Pierre-Alexandre Mattei & Charles Bouveyron & Thomas Brendan Murphy, 2022. "Unobserved classes and extra variables in high-dimensional discriminant analysis," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 16(1), pages 55-92, March.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:advdac:v:18:y:2024:i:3:d:10.1007_s11634-023-00569-z. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.