IDEAS home Printed from https://ideas.repec.org/a/bla/scjsta/v51y2024i2p672-696.html
   My bibliography  Save this article

A new paradigm for high‐dimensional data: Distance‐based semiparametric feature aggregation framework via between‐subject attributes

Author

Listed:
  • Jinyuan Liu
  • Xinlian Zhang
  • Tuo Lin
  • Ruohui Chen
  • Yuan Zhong
  • Tian Chen
  • Tsungchin Wu
  • Chenyu Liu
  • Anna Huang
  • Tanya T. Nguyen
  • Ellen E. Lee
  • Dilip V. Jeste
  • Xin M. Tu

Abstract

This article proposes a distance‐based framework incentivized by the paradigm shift toward feature aggregation for high‐dimensional data, which does not rely on the sparse‐feature assumption or the permutation‐based inference. Focusing on distance‐based outcomes that preserve information without truncating any features, a class of semiparametric regression has been developed, which encapsulates multiple sources of high‐dimensional variables using pairwise outcomes of between‐subject attributes. Further, we propose a strategy to address the interlocking correlations among pairs via the U‐statistics‐based estimating equations (UGEE), which correspond to their unique efficient influence function (EIF). Hence, the resulting semiparametric estimators are robust to distributional misspecification while enjoying root‐n consistency and asymptotic optimality to facilitate inference. In essence, the proposed approach not only circumvents information loss due to feature selection but also improves the model's interpretability and computational feasibility. Simulation studies and applications to the human microbiome and wearables data are provided, where the feature dimensions are tens of thousands.

Suggested Citation

  • Jinyuan Liu & Xinlian Zhang & Tuo Lin & Ruohui Chen & Yuan Zhong & Tian Chen & Tsungchin Wu & Chenyu Liu & Anna Huang & Tanya T. Nguyen & Ellen E. Lee & Dilip V. Jeste & Xin M. Tu, 2024. "A new paradigm for high‐dimensional data: Distance‐based semiparametric feature aggregation framework via between‐subject attributes," Scandinavian Journal of Statistics, Danish Society for Theoretical Statistics;Finnish Statistical Society;Norwegian Statistical Association;Swedish Statistical Association, vol. 51(2), pages 672-696, June.
  • Handle: RePEc:bla:scjsta:v:51:y:2024:i:2:p:672-696
    DOI: 10.1111/sjos.12695
    as

    Download full text from publisher

    File URL: https://doi.org/10.1111/sjos.12695
    Download Restriction: no

    File URL: https://libkey.io/10.1111/sjos.12695?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Victor Chernozhukov & Denis Chetverikov & Mert Demirer & Esther Duflo & Christian Hansen & Whitney Newey & James Robins, 2018. "Double/debiased machine learning for treatment and structural parameters," Econometrics Journal, Royal Economic Society, vol. 21(1), pages 1-68, February.
    2. Michael Vogt & Matthias Schmid, 2021. "Clustering with statistical error control," Scandinavian Journal of Statistics, Danish Society for Theoretical Statistics;Finnish Statistical Society;Norwegian Statistical Association;Swedish Statistical Association, vol. 48(3), pages 729-760, September.
    3. Stefan Wager & Susan Athey, 2018. "Estimation and Inference of Heterogeneous Treatment Effects using Random Forests," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 113(523), pages 1228-1242, July.
    4. Robert Tibshirani, 2011. "Regression shrinkage and selection via the lasso: a retrospective," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 73(3), pages 273-282, June.
    5. Eddelbuettel, Dirk & Francois, Romain, 2011. "Rcpp: Seamless R and C++ Integration," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 40(i08).
    6. Peter J. Turnbaugh & Ruth E. Ley & Micah Hamady & Claire M. Fraser-Liggett & Rob Knight & Jeffrey I. Gordon, 2007. "The Human Microbiome Project," Nature, Nature, vol. 449(7164), pages 804-810, October.
    7. Zou, Hui, 2006. "The Adaptive Lasso and Its Oracle Properties," Journal of the American Statistical Association, American Statistical Association, vol. 101, pages 1418-1429, December.
    8. Goslee, Sarah C. & Urban, Dean L., 2007. "The ecodist Package for Dissimilarity-based Analysis of Ecological Data," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 22(i07).
    9. Michael J. Daniels & Robert E. Kass, 2001. "Shrinkage Estimators for Covariance Matrices," Biometrics, The International Biometric Society, vol. 57(4), pages 1173-1184, December.
    10. Philip T. Reiss & M. Henry H. Stevens & Zarrar Shehzad & Eva Petkova & Michael P. Milham, 2010. "On Distance-Based Permutation Tests for Between-Group Comparisons," Biometrics, The International Biometric Society, vol. 66(2), pages 636-643, June.
    11. Yaqing Chen & Zhenhua Lin & Hans-Georg Müller, 2023. "Wasserstein Regression," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 118(542), pages 869-882, April.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Jin, Shaobo & Moustaki, Irini & Yang-Wallentin, Fan, 2018. "Approximated penalized maximum likelihood for exploratory factor analysis: an orthogonal case," LSE Research Online Documents on Economics 88118, London School of Economics and Political Science, LSE Library.
    2. Agboola, Oluwagbenga David & Yu, Han, 2023. "Neighborhood-based cross fitting approach to treatment effects with high-dimensional data," Computational Statistics & Data Analysis, Elsevier, vol. 186(C).
    3. Ricardo P. Masini & Marcelo C. Medeiros & Eduardo F. Mendes, 2023. "Machine learning advances for time series forecasting," Journal of Economic Surveys, Wiley Blackwell, vol. 37(1), pages 76-111, February.
    4. Shaobo Jin & Irini Moustaki & Fan Yang-Wallentin, 2018. "Approximated Penalized Maximum Likelihood for Exploratory Factor Analysis: An Orthogonal Case," Psychometrika, Springer;The Psychometric Society, vol. 83(3), pages 628-649, September.
    5. Alena Skolkova, 2023. "Instrumental Variable Estimation with Many Instruments Using Elastic-Net IV," CERGE-EI Working Papers wp759, The Center for Economic Research and Graduate Education - Economics Institute, Prague.
    6. Alexandre Belloni & Victor Chernozhukov & Denis Chetverikov & Christian Hansen & Kengo Kato, 2018. "High-dimensional econometrics and regularized GMM," CeMMAP working papers CWP35/18, Centre for Microdata Methods and Practice, Institute for Fiscal Studies.
    7. Nicolaj N. Mühlbach, 2020. "Tree-based Synthetic Control Methods: Consequences of moving the US Embassy," CREATES Research Papers 2020-04, Department of Economics and Business Economics, Aarhus University.
    8. Kyle Colangelo & Ying-Ying Lee, 2019. "Double debiased machine learning nonparametric inference with continuous treatments," CeMMAP working papers CWP72/19, Centre for Microdata Methods and Practice, Institute for Fiscal Studies.
    9. Ruoxuan Xiong & Allison Koenecke & Michael Powell & Zhu Shen & Joshua T. Vogelstein & Susan Athey, 2021. "Federated Causal Inference in Heterogeneous Observational Data," Papers 2107.11732, arXiv.org, revised Apr 2023.
    10. Arne Henningsen & Guy Low & David Wuepper & Tobias Dalhaus & Hugo Storm & Dagim Belay & Stefan Hirsch, 2024. "Estimating Causal Effects with Observational Data: Guidelines for Agricultural and Applied Economists," IFRO Working Paper 2024/03, University of Copenhagen, Department of Food and Resource Economics.
    11. Davide Viviano & Jelena Bradic, 2019. "Synthetic learner: model-free inference on treatments over time," Papers 1904.01490, arXiv.org, revised Aug 2022.
    12. Xinkun Nie & Stefan Wager, 2017. "Quasi-Oracle Estimation of Heterogeneous Treatment Effects," Papers 1712.04912, arXiv.org, revised Aug 2020.
    13. Elek, Péter & Bíró, Anikó, 2021. "Regional differences in diabetes across Europe – regression and causal forest analyses," Economics & Human Biology, Elsevier, vol. 40(C).
    14. Michael C. Knaus, 2021. "A double machine learning approach to estimate the effects of musical practice on student’s skills," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 184(1), pages 282-300, January.
    15. Michael C Knaus & Michael Lechner & Anthony Strittmatter, 2021. "Machine learning estimation of heterogeneous causal effects: Empirical Monte Carlo evidence," The Econometrics Journal, Royal Economic Society, vol. 24(1), pages 134-161.
    16. Combes, Pierre-Philippe & Gobillon, Laurent & Zylberberg, Yanos, 2022. "Urban economics in a historical perspective: Recovering data with machine learning," Regional Science and Urban Economics, Elsevier, vol. 94(C).
    17. Kyle Colangelo & Ying-Ying Lee, 2019. "Double debiased machine learning nonparametric inference with continuous treatments," CeMMAP working papers CWP54/19, Centre for Microdata Methods and Practice, Institute for Fiscal Studies.
    18. Bokelmann, Björn & Lessmann, Stefan, 2024. "Improving uplift model evaluation on randomized controlled trial data," European Journal of Operational Research, Elsevier, vol. 313(2), pages 691-707.
    19. Chunrong Ai & Oliver Linton & Kaiji Motegi & Zheng Zhang, 2021. "A unified framework for efficient estimation of general treatment models," Quantitative Economics, Econometric Society, vol. 12(3), pages 779-816, July.
    20. Daniel Goller, 2023. "Analysing a built-in advantage in asymmetric darts contests using causal machine learning," Annals of Operations Research, Springer, vol. 325(1), pages 649-679, June.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:scjsta:v:51:y:2024:i:2:p:672-696. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: http://www.blackwellpublishing.com/journal.asp?ref=0303-6898 .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.