IDEAS home Printed from https://ideas.repec.org/a/gam/jmathe/v13y2025i3p441-d1579109.html
   My bibliography  Save this article

On Data-Enriched Logistic Regression

Author

Listed:
  • Cheng Zheng

    (Department of Biostatistics, University of Nebraska Medical Center, Omaha, NE 68198, USA)

  • Sayan Dasgupta

    (Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Center, Seattle, WA 98109, USA)

  • Yuxiang Xie

    (Department of Biostatistics, University of Washington, Seattle, WA 98195, USA)

  • Asad Haris

    (Department of Biostatistics, University of Washington, Seattle, WA 98195, USA)

  • Ying-Qing Chen

    (Department of Medicine, Stanford University, Palo Alto, CA 94305, USA)

Abstract

Biomedical researchers typically investigate the effects of specific exposures on disease risks within a well-defined population. The gold standard for such studies is to design a trial with an appropriately sampled cohort. However, due to the high cost of such trials, the collected sample sizes are often limited, making it difficult to accurately estimate the effects of certain exposures. In this paper, we discuss how to leverage the information from external “big data” (datasets with significantly larger sample sizes) to improve the estimation accuracy at the risk of introducing a small amount of bias. We propose a family of weighted estimators to balance bias increase and variance reduction when incorporating the big data. We establish a connection between our proposed estimator and the well-known penalized regression estimators. We derive optimal weights using both second-order and higher-order asymptotic expansions. Through extensive simulation studies, we demonstrate that the improvement in mean square error (MSE) for the regression coefficient can be substantial even with finite sample sizes, and our weighted method outperformed existing approaches such as penalized regression and James–Stein estimator. Additionally, we provide a theoretical guarantee that the proposed estimators will never yield an asymptotic MSE larger than the maximum likelihood estimator using small data only in general. Finally, we apply our proposed methods to the Asia Cohort Consortium China cohort data to estimate the relationships between age, BMI, smoking, alcohol use, and mortality.

Suggested Citation

  • Cheng Zheng & Sayan Dasgupta & Yuxiang Xie & Asad Haris & Ying-Qing Chen, 2025. "On Data-Enriched Logistic Regression," Mathematics, MDPI, vol. 13(3), pages 1-21, January.
  • Handle: RePEc:gam:jmathe:v:13:y:2025:i:3:p:441-:d:1579109
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2227-7390/13/3/441/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2227-7390/13/3/441/
    Download Restriction: no
    ---><---

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:13:y:2025:i:3:p:441-:d:1579109. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.