IDEAS home Printed from https://ideas.repec.org/a/eee/csdana/v53y2009i12p3987-3998.html
   My bibliography  Save this article

Partition clustering of high dimensional low sample size data based on p-values

Author

Listed:
  • von Borries, George
  • Wang, Haiyan

Abstract

Clustering techniques play an important role in analyzing high dimensional data that is common in high-throughput screening such as microarray and mass spectrometry data. Effective use of the high dimensionality and some replications can help to increase clustering accuracy and stability. In this article a new partitioning algorithm with a robust distance measure is introduced to cluster variables in high dimensional low sample size (HDLSS) data that contain a large number of independent variables with a small number of replications per variable. The proposed clustering algorithm, PPCLUST, considers data from a mixture distribution and uses p-values from nonparametric rank tests of homogeneous distribution as a measure of similarity to separate the mixture components. PPCLUST is able to efficiently cluster a large number of variables in the presence of very few replications. Inherited from the robustness of rank procedure, the new algorithm is robust to outliers and invariant to monotone transformations of data. Numerical studies and an application to microarray gene expression data for colorectal cancer study are discussed.

Suggested Citation

  • von Borries, George & Wang, Haiyan, 2009. "Partition clustering of high dimensional low sample size data based on p-values," Computational Statistics & Data Analysis, Elsevier, vol. 53(12), pages 3987-3998, October.
  • Handle: RePEc:eee:csdana:v:53:y:2009:i:12:p:3987-3998
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0167-9473(09)00241-2
    Download Restriction: Full text for ScienceDirect subscribers only.
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Lawrence Hubert & Phipps Arabie, 1985. "Comparing partitions," Journal of Classification, Springer;The Classification Society, vol. 2(1), pages 193-218, December.
    2. Akritas M.G. & Papadatos N., 2004. "Heteroscedastic One-Way ANOVA and Lack-of-Fit Tests," Journal of the American Statistical Association, American Statistical Association, vol. 99, pages 368-382, January.
    3. Gabor J. Szekely & Maria L. Rizzo, 2005. "Hierarchical Clustering via Joint Between-Within Distances: Extending Ward's Minimum Variance Method," Journal of Classification, Springer;The Classification Society, vol. 22(2), pages 151-183, September.
    4. John D. Storey, 2002. "A direct approach to false discovery rates," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 64(3), pages 479-498, August.
    5. Efron, Bradley, 2007. "Correlation and Large-Scale Simultaneous Significance Testing," Journal of the American Statistical Association, American Statistical Association, vol. 102, pages 93-103, March.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Bouveyron, Charles & Brunet-Saumard, Camille, 2014. "Model-based clustering of high-dimensional data: A review," Computational Statistics & Data Analysis, Elsevier, vol. 71(C), pages 52-78.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Wen Shi & Xi Chen & Jennifer Shang, 2019. "An Efficient Morris Method-Based Framework for Simulation Factor Screening," INFORMS Journal on Computing, INFORMS, vol. 31(4), pages 745-770, October.
    2. Jianqing Fan & Xu Han, 2017. "Estimation of the false discovery proportion with unknown dependence," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 79(4), pages 1143-1164, September.
    3. Renato Amorim, 2015. "Feature Relevance in Ward’s Hierarchical Clustering Using the L p Norm," Journal of Classification, Springer;The Classification Society, vol. 32(1), pages 46-62, April.
    4. Nik Tuzov & Frederi Viens, 2011. "Mutual fund performance: false discoveries, bias, and power," Annals of Finance, Springer, vol. 7(2), pages 137-169, May.
    5. Miecznikowski, Jeffrey C. & Gold, David & Shepherd, Lori & Liu, Song, 2011. "Deriving and comparing the distribution for the number of false positives in single step methods to control k-FWER," Statistics & Probability Letters, Elsevier, vol. 81(11), pages 1695-1705, November.
    6. Alan Lee & Bobby Willcox, 2014. "Minkowski Generalizations of Ward’s Method in Hierarchical Clustering," Journal of Classification, Springer;The Classification Society, vol. 31(2), pages 194-218, July.
    7. Friguet, Chloé & Causeur, David, 2011. "Estimation of the proportion of true null hypotheses in high-dimensional data under dependence," Computational Statistics & Data Analysis, Elsevier, vol. 55(9), pages 2665-2676, September.
    8. I. Ahmed & C. Dalmasso & F. Haramburu & F. Thiessard & P. Broët & P. Tubert-Bitter, 2010. "False Discovery Rate Estimation for Frequentist Pharmacovigilance Signal Detection Methods," Biometrics, The International Biometric Society, vol. 66(1), pages 301-309, March.
    9. T. Tony Cai & Weidong Liu, 2016. "Large-Scale Multiple Testing of Correlations," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 111(513), pages 229-240, March.
    10. Wenguang Sun & T. Tony Cai, 2009. "Large‐scale multiple testing under dependence," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 71(2), pages 393-424, April.
    11. Ali Karimnezhad, 2022. "A simple yet efficient method of local false discovery rate estimation designed for genome-wide association data analysis," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 31(1), pages 159-180, March.
    12. Wang, Xia & Shojaie, Ali & Zou, Jian, 2019. "Bayesian hidden Markov models for dependent large-scale multiple testing," Computational Statistics & Data Analysis, Elsevier, vol. 136(C), pages 123-136.
    13. Alicja Grześkowiak, 2016. "Assessment of Participation in Cultural Activities in Poland by Selected Multivariate Methods," European Journal of Social Sciences Education and Research Articles, Revistia Research and Publishing, vol. 3, January -.
    14. Yunpeng Zhao & Qing Pan & Chengan Du, 2019. "Logistic regression augmented community detection for network data with application in identifying autism‐related gene pathways," Biometrics, The International Biometric Society, vol. 75(1), pages 222-234, March.
    15. Youngchao Ge & Sandrine Dudoit & Terence Speed, 2003. "Resampling-based multiple testing for microarray data analysis," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 12(1), pages 1-77, June.
    16. Wu, Han-Ming & Tien, Yin-Jing & Chen, Chun-houh, 2010. "GAP: A graphical environment for matrix visualization and cluster analysis," Computational Statistics & Data Analysis, Elsevier, vol. 54(3), pages 767-778, March.
    17. José E. Chacón, 2021. "Explicit Agreement Extremes for a 2 × 2 Table with Given Marginals," Journal of Classification, Springer;The Classification Society, vol. 38(2), pages 257-263, July.
    18. F. Marta L. Di Lascio & Andrea Menapace & Roberta Pappadà, 2024. "A spatially‐weighted AMH copula‐based dissimilarity measure for clustering variables: An application to urban thermal efficiency," Environmetrics, John Wiley & Sons, Ltd., vol. 35(1), February.
    19. Yifan Zhu & Chongzhi Di & Ying Qing Chen, 2019. "Clustering Functional Data with Application to Electronic Medication Adherence Monitoring in HIV Prevention Trials," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 11(2), pages 238-261, July.
    20. Irene Vrbik & Paul McNicholas, 2015. "Fractionally-Supervised Classification," Journal of Classification, Springer;The Classification Society, vol. 32(3), pages 359-381, October.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:53:y:2009:i:12:p:3987-3998. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/csda .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.