IDEAS home Printed from https://ideas.repec.org/a/eee/csdana/v197y2024ics0167947324000598.html
   My bibliography  Save this article

Transfer learning via random forests: A one-shot federated approach

Author

Listed:
  • Xiang, Pengcheng
  • Zhou, Ling
  • Tang, Lu

Abstract

A one-shot federated transfer learning method using random forests (FTRF) is developed to improve the prediction accuracy at a target data site by leveraging information from auxiliary sites. Both theoretical and numerical results show that the proposed federated transfer learning approach is at least as accurate as the model trained on the target data alone regardless of possible data heterogeneity, which includes imbalanced and non-IID data distributions across sites and model mis-specification. FTRF has the ability to evaluate the similarity between the target and auxiliary sites, enabling the target site to autonomously select more similar site information to enhance its predictive performance. To ensure communication efficiency, FTRF adopts the model averaging idea that requires a single round of communication between the target and the auxiliary sites. Only fitted models from auxiliary sites are sent to the target site. Unlike traditional model averaging, FTRF incorporates predicted outcomes from other sites and the original variables when estimating model averaging weights, resulting in a variable-dependent weighting to better utilize models from auxiliary sites to improve prediction. Five real-world data examples show that FTRF reduces the prediction error by 2-40% compared to methods not utilizing auxiliary information.

Suggested Citation

  • Xiang, Pengcheng & Zhou, Ling & Tang, Lu, 2024. "Transfer learning via random forests: A one-shot federated approach," Computational Statistics & Data Analysis, Elsevier, vol. 197(C).
  • Handle: RePEc:eee:csdana:v:197:y:2024:i:c:s0167947324000598
    DOI: 10.1016/j.csda.2024.107975
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0167947324000598
    Download Restriction: Full text for ScienceDirect subscribers only.

    File URL: https://libkey.io/10.1016/j.csda.2024.107975?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Tang, Lu & Zhou, Ling & Song, Peter X.-K., 2020. "Distributed simultaneous inference in generalized linear models via confidence distribution," Journal of Multivariate Analysis, Elsevier, vol. 176(C).
    2. Ye Tian & Yang Feng, 2023. "Transfer Learning Under High-Dimensional Generalized Linear Models," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 118(544), pages 2684-2697, October.
    3. Yuan Gao & Weidong Liu & Hansheng Wang & Xiaozhou Wang & Yibo Yan & Riquan Zhang, 2022. "A review of distributed statistical inference," Statistical Theory and Related Fields, Taylor & Francis Journals, vol. 6(2), pages 89-99, May.
    4. Hamsa Bastani, 2021. "Predicting with Proxies: Transfer Learning in High Dimension," Management Science, INFORMS, vol. 67(5), pages 2964-2984, May.
    5. Ishwaran, Hemant & Kogalur, Udaya B. & Gorodeski, Eiran Z. & Minn, Andy J. & Lauer, Michael S., 2010. "High-Dimensional Variable Selection for Survival Data," Journal of the American Statistical Association, American Statistical Association, vol. 105(489), pages 205-217.
    6. Michael I. Jordan & Jason D. Lee & Yun Yang, 2019. "Communication-Efficient Distributed Statistical Inference," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 114(526), pages 668-681, April.
    7. Yuan Gao & Weidong Liu & Hansheng Wang & Xiaozhou Wang & Yibo Yan & Riquan Zhang, 2022. "Rejoinder on ‘A review of distributed statistical inference’," Statistical Theory and Related Fields, Taylor & Francis Journals, vol. 6(2), pages 111-113, May.
    8. Sai Li & T. Tony Cai & Hongzhe Li, 2023. "Transfer Learning in Large-Scale Gaussian Graphical Models with False Discovery Rate Control," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 118(543), pages 2171-2183, July.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Lu Lin & Feng Li, 2023. "Global debiased DC estimations for biased estimators via pro forma regression," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 32(2), pages 726-758, June.
    2. Hao Zeng & Wei Zhong & Xingbai Xu, 2024. "Transfer Learning for Spatial Autoregressive Models with Application to U.S. Presidential Election Prediction," Papers 2405.15600, arXiv.org, revised Sep 2024.
    3. Wei Wang & Shou‐En Lu & Jerry Q. Cheng & Minge Xie & John B. Kostis, 2022. "Multivariate survival analysis in big data: A divide‐and‐combine approach," Biometrics, The International Biometric Society, vol. 78(3), pages 852-866, September.
    4. George Karabatsos, 2024. "Copula Approximate Bayesian Computation Using Distribution Random Forests," Stats, MDPI, vol. 7(3), pages 1-49, September.
    5. Changgee Chang & Zhiqi Bu & Qi Long, 2023. "CEDAR: communication efficient distributed analysis for regressions," Biometrics, The International Biometric Society, vol. 79(3), pages 2357-2369, September.
    6. Zemin Zheng & Jie Zhang & Yang Li, 2022. "L 0 -Regularized Learning for High-Dimensional Additive Hazards Regression," INFORMS Journal on Computing, INFORMS, vol. 34(5), pages 2762-2775, September.
    7. Adel Javanmard & Jingwei Ji & Renyuan Xu, 2024. "Multi-Task Dynamic Pricing in Credit Market with Contextual Information," Papers 2410.14839, arXiv.org, revised Oct 2024.
    8. Shang-Ming Zhou & Fabiola Fernandez-Gutierrez & Jonathan Kennedy & Roxanne Cooksey & Mark Atkinson & Spiros Denaxas & Stefan Siebert & William G Dixon & Terence W O’Neill & Ernest Choy & Cathie Sudlow, 2016. "Defining Disease Phenotypes in Primary Care Electronic Health Records by a Machine Learning Approach: A Case Study in Identifying Rheumatoid Arthritis," PLOS ONE, Public Library of Science, vol. 11(5), pages 1-14, May.
    9. Youngjoo Cho & Debashis Ghosh, 2021. "Quantile-Based Subgroup Identification for Randomized Clinical Trials," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 13(1), pages 90-128, April.
    10. Xingcai Zhou & Zhaoyang Jing & Chao Huang, 2024. "Distributed Bootstrap Simultaneous Inference for High-Dimensional Quantile Regression," Mathematics, MDPI, vol. 12(5), pages 1-53, February.
    11. Guangbao Guo & Guoqi Qian & Lu Lin & Wei Shao, 2021. "Parallel inference for big data with the group Bayesian method," Metrika: International Journal for Theoretical and Applied Statistics, Springer, vol. 84(2), pages 225-243, February.
    12. Ren, Yimeng & Li, Zhe & Zhu, Xuening & Gao, Yuan & Wang, Hansheng, 2024. "Distributed estimation and inference for spatial autoregression model with large scale networks," Journal of Econometrics, Elsevier, vol. 238(2).
    13. Han, Dongxiao & Huang, Jian & Lin, Yuanyuan & Shen, Guohao, 2022. "Robust post-selection inference of high-dimensional mean regression with heavy-tailed asymmetric or heteroskedastic errors," Journal of Econometrics, Elsevier, vol. 230(2), pages 416-431.
    14. Wang, Kangning & Li, Shaomin, 2021. "Robust distributed modal regression for massive data," Computational Statistics & Data Analysis, Elsevier, vol. 160(C).
    15. Christine Porzelius & Martin Schumacher & Harald Binder, 2011. "The benefit of data-based model complexity selection via prediction error curves in time-to-event data," Computational Statistics, Springer, vol. 26(2), pages 293-302, June.
    16. Foucher Yohann & Danger Richard, 2012. "Time Dependent ROC Curves for the Estimation of True Prognostic Capacity of Microarray Data," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 11(6), pages 1-22, November.
    17. J. Choi & S. Ye & K. H. Eng & K. Korthauer & W. H. Bradley & J. S. Rader & C. Kendziorski, 2017. "IPI59: An Actionable Biomarker to Improve Treatment Response in Serous Ovarian Carcinoma Patients," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 9(1), pages 1-12, June.
    18. Peter Calhoun & Melodie J. Hallett & Xiaogang Su & Guy Cafri & Richard A. Levine & Juanjuan Fan, 2020. "Random forest with acceptance–rejection trees," Computational Statistics, Springer, vol. 35(3), pages 983-999, September.
    19. Hoora Moradian & Denis Larocque & François Bellavance, 2017. "$$L_1$$ L 1 splitting rules in survival forests," Lifetime Data Analysis: An International Journal Devoted to Statistical Methods and Applications for Time-to-Event Data, Springer, vol. 23(4), pages 671-691, October.
    20. Benny Ren & Ian Barnett, 2022. "Autoregressive mixture models for clustering time series," Journal of Time Series Analysis, Wiley Blackwell, vol. 43(6), pages 918-937, November.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:197:y:2024:i:c:s0167947324000598. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/csda .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.