IDEAS home Printed from https://ideas.repec.org/a/bla/biomet/v76y2020i2p508-517.html
   My bibliography  Save this article

Data reduction prior to inference: Are there consequences of comparing groups using a t‐test based on principal component scores?

Author

Listed:
  • Edward J. Bedrick

Abstract

Researchers often use a two‐step process to analyze multivariate data. First, dimensionality is reduced using a technique such as principal component analysis, followed by a group comparison using a t‐test or analysis of variance. Although this practice is often discouraged, the statistical properties of this procedure are not well understood, starting with the hypothesis being tested. We suggest that this approach might be considering two distinct hypotheses, one of which is a global test of no differences in the mean vectors, and the other being a focused test of a specific linear combination where the coefficients have been estimated from the data. We study the asymptotic properties of the two‐sample t‐statistic for these two scenarios, assuming a nonsparse setting. We show that the size of the global test agrees with the presumed level but that the test has poor power. In contrast, the size of the focused test can be arbitrarily distorted with certain mean and covariance structures. A simple method is provided to correct the size of the focused test. Data analyses and simulations are used to illustrate the results. Recommendations on the use of this two‐step method and the related use of principal components for prediction are provided.

Suggested Citation

  • Edward J. Bedrick, 2020. "Data reduction prior to inference: Are there consequences of comparing groups using a t‐test based on principal component scores?," Biometrics, The International Biometric Society, vol. 76(2), pages 508-517, June.
  • Handle: RePEc:bla:biomet:v:76:y:2020:i:2:p:508-517
    DOI: 10.1111/biom.13159
    as

    Download full text from publisher

    File URL: https://doi.org/10.1111/biom.13159
    Download Restriction: no

    File URL: https://libkey.io/10.1111/biom.13159?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Wei‐Chien Chang, 1983. "On Using Principal Components before Separating a Mixture of Two Multivariate Normal Distributions," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 32(3), pages 267-275, November.
    2. Shelley Edwards & Bieke Vanhooydonck & Anthony Herrel & G John Measey & Krystal A Tolley, 2012. "Convergent Evolution Associated with Habitat Decouples Phenotype from Phylogeny in a Clade of Lizards," PLOS ONE, Public Library of Science, vol. 7(12), pages 1-9, December.
    3. Roger S. Zoh & Abhra Sarkar & Raymond J. Carroll & Bani K. Mallick, 2018. "A Powerful Bayesian Test for Equality of Means in High Dimensions," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 113(524), pages 1733-1741, October.
    4. Kollo, T. & Neudecker, H., 1993. "Asymptotics of Eigenvalues and Unit-Length Eigenvectors of Sample Variance and Correlation Matrices," Journal of Multivariate Analysis, Elsevier, vol. 47(2), pages 283-300, November.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Wang, Zihan & Daeipour, Mohamad & Xu, Hongyi, 2023. "Quantification and propagation of Aleatoric uncertainties in topological structures," Reliability Engineering and System Safety, Elsevier, vol. 233(C).
    2. J. O. Bauer & B. Drabant, 2021. "Regression based thresholds in principal loading analysis," Papers 2103.06691, arXiv.org, revised Mar 2022.
    3. Ahlquist, John S. & Breunig, Christian, 2009. "Country clustering in comparative political economy," MPIfG Discussion Paper 09/5, Max Planck Institute for the Study of Societies.
    4. McLachlan, G. J. & Peel, D. & Bean, R. W., 2003. "Modelling high-dimensional data by mixtures of factor analyzers," Computational Statistics & Data Analysis, Elsevier, vol. 41(3-4), pages 379-388, January.
    5. Lazhar Labiod & Mohamed Nadif, 2021. "Efficient regularized spectral data embedding," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 15(1), pages 99-119, March.
    6. Robert Kapłon, 2006. "A retrospective review of categorical data analysis – theory and marketing practice," Operations Research and Decisions, Wroclaw University of Science and Technology, Faculty of Management, vol. 16(1), pages 55-72.
    7. Liu, Shuangzhe & Leiva, Víctor & Zhuang, Dan & Ma, Tiefeng & Figueroa-Zúñiga, Jorge I., 2022. "Matrix differential calculus with applications in the multivariate linear model and its diagnostics," Journal of Multivariate Analysis, Elsevier, vol. 188(C).
    8. Zhiliang Ma & Adam Cardinal-Stakenas & Youngser Park & Michael Trosset & Carey Priebe, 2010. "Dimensionality Reduction on the Cartesian Product of Embeddings of Multiple Dissimilarity Matrices," Journal of Classification, Springer;The Classification Society, vol. 27(3), pages 307-321, November.
    9. Alessandro Casa & Giovanna Menardi, 2022. "Nonparametric semi-supervised classification with application to signal detection in high energy physics," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 31(3), pages 531-550, September.
    10. Boik, Robert J., 1998. "A Local Parameterization of Orthogonal and Semi-Orthogonal Matrices with Applications," Journal of Multivariate Analysis, Elsevier, vol. 67(2), pages 244-276, November.
    11. Haruhiko Ogasawara, 2002. "Concise formulas for the standard errors of component loading estimates," Psychometrika, Springer;The Psychometric Society, vol. 67(2), pages 289-297, June.
    12. Liang, Faming, 2007. "Use of SVD-based probit transformation in clustering gene expression profiles," Computational Statistics & Data Analysis, Elsevier, vol. 51(12), pages 6355-6366, August.
    13. Douglas Steinley, 2009. "F. Murtagh (2005). Correspondence analysis and data coding with Java and R. 230 pp., US$76.00. ISBN 1584885289," Psychometrika, Springer;The Psychometric Society, vol. 74(1), pages 181-183, March.
    14. Dirk Depril & Iven Mechelen & Tom Wilderjans, 2012. "Lowdimensional Additive Overlapping Clustering," Journal of Classification, Springer;The Classification Society, vol. 29(3), pages 297-320, October.
    15. Steland, Ansgar & von Sachs, Rainer, 2016. "Asymptotics for High–Dimensional Covariance Matrices and Quadratic Forms with Applications to the Trace Functional and Shrinkage," LIDAM Discussion Papers ISBA 2016038, Université catholique de Louvain, Institute of Statistics, Biostatistics and Actuarial Sciences (ISBA).
    16. Michael C. Thrun & Alfred Ultsch, 2021. "Using Projection-Based Clustering to Find Distance- and Density-Based Clusters in High-Dimensional Data," Journal of Classification, Springer;The Classification Society, vol. 38(2), pages 280-312, July.
    17. Andrews, Jeffrey L., 2018. "Addressing overfitting and underfitting in Gaussian model-based clustering," Computational Statistics & Data Analysis, Elsevier, vol. 127(C), pages 160-171.
    18. Douglas Steinley & Lawrence Hubert, 2008. "Order-Constrained Solutions in K-Means Clustering: Even Better Than Being Globally Optimal," Psychometrika, Springer;The Psychometric Society, vol. 73(4), pages 647-664, December.
    19. Chen, Dachuan, 2024. "High frequency principal component analysis based on correlation matrix that is robust to jumps, microstructure noise and asynchronous observation times," Journal of Econometrics, Elsevier, vol. 240(1).
    20. Bouveyron, Charles & Brunet, Camille, 2012. "Theoretical and practical considerations on the convergence properties of the Fisher-EM algorithm," Journal of Multivariate Analysis, Elsevier, vol. 109(C), pages 29-41.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:biomet:v:76:y:2020:i:2:p:508-517. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: http://www.blackwellpublishing.com/journal.asp?ref=0006-341X .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.