IDEAS home Printed from https://ideas.repec.org/a/eee/jmvana/v190y2022ics0047259x21001743.html
   My bibliography  Save this article

Some clustering-based exact distribution-free k-sample tests applicable to high dimension, low sample size data

Author

Listed:
  • Paul, Biplab
  • De, Shyamal K.
  • Ghosh, Anil K.

Abstract

Testing homogeneity of k(≥2) multivariate distributions is a challenging problem in statistics, especially when the dimension of the data is much larger than the sample size. Most of the existing tests often perform poorly in this high dimension, low sample size (HDLSS) regime, and many of them cannot be used at all. In this article, we propose some nonparametric tests for this purpose. These tests have the distribution-free property in finite sample situations. They are based on a high dimensional clustering algorithm that makes a partition of the data to form a contingency table. Using the cell frequencies of that table, we construct the test statistics. We can develop tests based on a k-partition of the data or estimate the number of partitions from the data and construct tests based on it. Under appropriate regularity conditions, we prove the consistency of these tests in the HDLSS asymptotic regime. We also consider a multiscale approach, where the results for different number of partitions are aggregated judiciously. Extensive simulation study and analysis of some benchmark datasets illustrate the superiority of the proposed tests over some existing methods.

Suggested Citation

  • Paul, Biplab & De, Shyamal K. & Ghosh, Anil K., 2022. "Some clustering-based exact distribution-free k-sample tests applicable to high dimension, low sample size data," Journal of Multivariate Analysis, Elsevier, vol. 190(C).
  • Handle: RePEc:eee:jmvana:v:190:y:2022:i:c:s0047259x21001743
    DOI: 10.1016/j.jmva.2021.104897
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0047259X21001743
    Download Restriction: Full text for ScienceDirect subscribers only

    File URL: https://libkey.io/10.1016/j.jmva.2021.104897?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Peter Hall & J. S. Marron & Amnon Neeman, 2005. "Geometric representation of high dimension, low sample size data," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 67(3), pages 427-444, June.
    2. Biswas, Munmun & Ghosh, Anil K., 2014. "A nonparametric two-sample test applicable to high dimensional data," Journal of Multivariate Analysis, Elsevier, vol. 123(C), pages 160-171.
    3. Long Feng & Changliang Zou & Zhaojun Wang, 2016. "Multivariate-Sign-Based High-Dimensional Tests for the Two-Sample Location Problem," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 111(514), pages 721-735, April.
    4. Paul R. Rosenbaum, 2005. "An exact distribution‐free test comparing two multivariate distributions based on adjacency," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 67(4), pages 515-530, September.
    5. Hao Chen & Xu Chen & Yi Su, 2018. "A Weighted Edge-Count Two-Sample Test for Multivariate and Object Data," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 113(523), pages 1146-1155, July.
    6. Zhenyu Liu & Reza Modarres, 2011. "A triangle test for equality of distribution functions in high dimensions," Journal of Nonparametric Statistics, Taylor & Francis Journals, vol. 23(3), pages 605-615.
    7. Mondal, Pronoy K. & Biswas, Munmun & Ghosh, Anil K., 2015. "On high dimensional two-sample tests based on nearest neighbors," Journal of Multivariate Analysis, Elsevier, vol. 141(C), pages 168-178.
    8. Robert Tibshirani & Guenther Walther & Trevor Hastie, 2001. "Estimating the number of clusters in a data set via the gap statistic," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 63(2), pages 411-423.
    9. Anil K. Ghosh & Munmun Biswas, 2016. "Distribution-free high-dimensional two-sample tests based on discriminating hyperplanes," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 25(3), pages 525-547, September.
    10. Baringhaus, L. & Franz, C., 2004. "On a new multivariate two-sample test," Journal of Multivariate Analysis, Elsevier, vol. 88(1), pages 190-206, January.
    11. Sugar, Catherine A. & James, Gareth M., 2003. "Finding the Number of Clusters in a Dataset: An Information-Theoretic Approach," Journal of the American Statistical Association, American Statistical Association, vol. 98, pages 750-763, January.
    12. Reza Modarres, 2020. "Graphical Comparison of High‐Dimensional Distributions," International Statistical Review, International Statistical Institute, vol. 88(3), pages 698-714, December.
    13. Junhui Wang, 2010. "Consistent selection of the number of clusters via crossvalidation," Biometrika, Biometrika Trust, vol. 97(4), pages 893-904.
    14. Shin-ichi Tsukada, 2019. "High dimensional two-sample test based on the inter-point distance," Computational Statistics, Springer, vol. 34(2), pages 599-615, June.
    15. Chen, Song Xi & Qin, Yingli, 2010. "A Two Sample Test for High Dimensional Data with Applications to Gene-set Testing," MPRA Paper 59642, University Library of Munich, Germany.
    16. Munmun Biswas & Minerva Mukhopadhyay & Anil K. Ghosh, 2014. "A distribution-free two-sample run test applicable to high-dimensional data," Biometrika, Biometrika Trust, vol. 101(4), pages 913-926.
    17. Peter Hall, 2002. "Permutation tests for equality of distributions in high-dimensional settings," Biometrika, Biometrika Trust, vol. 89(2), pages 359-374, June.
    18. Srivastava, Muni S. & Katayama, Shota & Kano, Yutaka, 2013. "A two sample test in high dimensional data," Journal of Multivariate Analysis, Elsevier, vol. 114(C), pages 349-358.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Shin-ichi Tsukada, 2019. "High dimensional two-sample test based on the inter-point distance," Computational Statistics, Springer, vol. 34(2), pages 599-615, June.
    2. Mondal, Pronoy K. & Biswas, Munmun & Ghosh, Anil K., 2015. "On high dimensional two-sample tests based on nearest neighbors," Journal of Multivariate Analysis, Elsevier, vol. 141(C), pages 168-178.
    3. Qiu, Tao & Zhang, Qintong & Fang, Yuanyuan & Xu, Wangli, 2024. "Testing homogeneity in high dimensional data through random projections," Journal of Multivariate Analysis, Elsevier, vol. 200(C).
    4. Biswas, Munmun & Ghosh, Anil K., 2014. "A nonparametric two-sample test applicable to high dimensional data," Journal of Multivariate Analysis, Elsevier, vol. 123(C), pages 160-171.
    5. Pini, Alessia & Stamm, Aymeric & Vantini, Simone, 2018. "Hotelling’s T2 in separable Hilbert spaces," Journal of Multivariate Analysis, Elsevier, vol. 167(C), pages 284-305.
    6. Saha, Enakshi & Sarkar, Soham & Ghosh, Anil K., 2017. "Some high-dimensional one-sample tests based on functions of interpoint distances," Journal of Multivariate Analysis, Elsevier, vol. 161(C), pages 83-95.
    7. Nicolas Städler & Sach Mukherjee, 2017. "Two-sample testing in high dimensions," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 79(1), pages 225-246, January.
    8. Zhang, Jin-Ting & Guo, Jia & Zhou, Bu, 2017. "Linear hypothesis testing in high-dimensional one-way MANOVA," Journal of Multivariate Analysis, Elsevier, vol. 155(C), pages 200-216.
    9. Cousido-Rocha, Marta & de Uña-Álvarez, Jacobo & Hart, Jeffrey D., 2019. "A two-sample test for the equality of univariate marginal distributions for high-dimensional data," Journal of Multivariate Analysis, Elsevier, vol. 174(C).
    10. Anil K. Ghosh & Munmun Biswas, 2016. "Distribution-free high-dimensional two-sample tests based on discriminating hyperplanes," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 25(3), pages 525-547, September.
    11. Reza Modarres, 2020. "Graphical Comparison of High‐Dimensional Distributions," International Statistical Review, International Statistical Institute, vol. 88(3), pages 698-714, December.
    12. Lovato, Ilenia & Pini, Alessia & Stamm, Aymeric & Vantini, Simone, 2020. "Model-free two-sample test for network-valued data," Computational Statistics & Data Analysis, Elsevier, vol. 144(C).
    13. Huang, Yuan & Li, Changcheng & Li, Runze & Yang, Songshan, 2022. "An overview of tests on high-dimensional means," Journal of Multivariate Analysis, Elsevier, vol. 188(C).
    14. Harrar, Solomon W. & Kong, Xiaoli, 2022. "Recent developments in high-dimensional inference for multivariate data: Parametric, semiparametric and nonparametric approaches," Journal of Multivariate Analysis, Elsevier, vol. 188(C).
    15. Jun Li, 2018. "Asymptotic normality of interpoint distances for high-dimensional data with applications to the two-sample problem," Biometrika, Biometrika Trust, vol. 105(3), pages 529-546.
    16. Liu, Zhi & Xia, Xiaochao & Zhou, Wang, 2015. "A test for equality of two distributions via jackknife empirical likelihood and characteristic functions," Computational Statistics & Data Analysis, Elsevier, vol. 92(C), pages 97-114.
    17. Zhang, Jin-Ting & Zhu, Tianming, 2022. "A new normal reference test for linear hypothesis testing in high-dimensional heteroscedastic one-way MANOVA," Computational Statistics & Data Analysis, Elsevier, vol. 168(C).
    18. Luai Al-Labadi & Forough Fazeli Asl & Zahra Saberi, 2022. "A Bayesian nonparametric multi-sample test in any dimension," AStA Advances in Statistical Analysis, Springer;German Statistical Society, vol. 106(2), pages 217-242, June.
    19. Li, Weiming & Xu, Yangchang, 2022. "Asymptotic properties of high-dimensional spatial median in elliptical distributions with application," Journal of Multivariate Analysis, Elsevier, vol. 190(C).
    20. Modarres, Reza, 2014. "On the interpoint distances of Bernoulli vectors," Statistics & Probability Letters, Elsevier, vol. 84(C), pages 215-222.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:jmvana:v:190:y:2022:i:c:s0047259x21001743. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/wps/find/journaldescription.cws_home/622892/description#description .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.