IDEAS home Printed from https://ideas.repec.org/a/inm/ormnsc/v67y2021i5p2964-2984.html
   My bibliography  Save this article

Predicting with Proxies: Transfer Learning in High Dimension

Author

Listed:
  • Hamsa Bastani

    (Operations Information and Decisions, The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104)

Abstract

Predictive analytics is increasingly used to guide decision making in many applications. However, in practice, we often have limited data on the true predictive task of interest and must instead rely on more abundant data on a closely related proxy predictive task. For example, e-commerce platforms use abundant customer click data (proxy) to make product recommendations rather than the relatively sparse customer purchase data (true outcome of interest); alternatively, hospitals often rely on medical risk scores trained on a different patient population (proxy) rather than their own patient population (true cohort of interest) to assign interventions. Yet, not accounting for the bias in the proxy can lead to suboptimal decisions. Using real data sets, we find that this bias can often be captured by a sparse function of the features. Thus, we propose a novel two-step estimator that uses techniques from high-dimensional statistics to efficiently combine a large amount of proxy data and a small amount of true data. We prove upper bounds on the error of our proposed estimator and lower bounds on several heuristics used by data scientists; in particular, our proposed estimator can achieve the same accuracy with exponentially less true data (in the number of features d ). Finally, we demonstrate the effectiveness of our approach on e-commerce and healthcare data sets; in both cases, we achieve significantly better predictive accuracy as well as managerial insights into the nature of the bias in the proxy data.

Suggested Citation

  • Hamsa Bastani, 2021. "Predicting with Proxies: Transfer Learning in High Dimension," Management Science, INFORMS, vol. 67(5), pages 2964-2984, May.
  • Handle: RePEc:inm:ormnsc:v:67:y:2021:i:5:p:2964-2984
    DOI: 10.1287/mnsc.2020.3729
    as

    Download full text from publisher

    File URL: http://dx.doi.org/10.1287/mnsc.2020.3729
    Download Restriction: no

    File URL: https://libkey.io/10.1287/mnsc.2020.3729?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Sendhil Mullainathan & Ziad Obermeyer, 2017. "Does Machine Learning Automate Moral Hazard and Error?," American Economic Review, American Economic Association, vol. 107(5), pages 476-480, May.
    2. A. Belloni & D. Chen & V. Chernozhukov & C. Hansen, 2012. "Sparse Models and Methods for Optimal Instruments With an Application to Eminent Domain," Econometrica, Econometric Society, vol. 80(6), pages 2369-2429, November.
    3. Alexandre Belloni & Victor Chernozhukov & Christian Hansen, 2014. "Inference on Treatment Effects after Selection among High-Dimensional Controlsâ€," The Review of Economic Studies, Review of Economic Studies Ltd, vol. 81(2), pages 608-650.
    4. Vivek F. Farias & Andrew A. L, 2019. "Learning Preferences with Side Information," Management Science, INFORMS, vol. 65(7), pages 3131-3149, July.
    5. Mehmet Eren Ahsen & Mehmet Ulvi Saygi Ayvaci & Srinivasan Raghunathan, 2019. "When Algorithmic Predictions Use Human-Generated Data: A Bias-Aware Classification Algorithm for Breast Cancer Diagnosis," Service Science, INFORMS, vol. 30(1), pages 97-116, March.
    6. Lukas Meier & Sara Van De Geer & Peter Bühlmann, 2008. "The group lasso for logistic regression," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 70(1), pages 53-71, February.
    7. Daria Dzyabura & Srikanth Jagabathula & Eitan Muller, 2019. "Accounting for Discrepancies Between Online and Offline Product Evaluations," Marketing Science, INFORMS, vol. 38(1), pages 88-106, January.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Maria De‐Arteaga & Stefan Feuerriegel & Maytal Saar‐Tsechansky, 2022. "Algorithmic fairness in business analytics: Directions for research and practice," Production and Operations Management, Production and Operations Management Society, vol. 31(10), pages 3749-3770, October.
    2. Xiang, Pengcheng & Zhou, Ling & Tang, Lu, 2024. "Transfer learning via random forests: A one-shot federated approach," Computational Statistics & Data Analysis, Elsevier, vol. 197(C).
    3. Hamsa Bastani & David Simchi-Levi & Ruihao Zhu, 2022. "Meta Dynamic Pricing: Transfer Learning Across Experiments," Management Science, INFORMS, vol. 68(3), pages 1865-1881, March.
    4. Brooks Oppenheimer, 2024. "Including “touch-and-feel” in online consumer research: optimizing information gain given costs of data online versus in-person," Journal of Marketing Analytics, Palgrave Macmillan, vol. 12(2), pages 411-416, June.
    5. Hao Zeng & Wei Zhong & Xingbai Xu, 2024. "Transfer Learning for Spatial Autoregressive Models with Application to U.S. Presidential Election Prediction," Papers 2405.15600, arXiv.org, revised Sep 2024.
    6. Sun, Fei & Zhang, Qi, 2023. "Robust transfer learning of high-dimensional generalized linear model," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 618(C).
    7. Singha, Sumanta & Arha, Himanshu & Kar, Arpan Kumar, 2023. "Healthcare analytics: A techno-functional perspective," Technological Forecasting and Social Change, Elsevier, vol. 197(C).

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Alexandre Belloni & Victor Chernozhukov & Ying Wei, 2016. "Post-Selection Inference for Generalized Linear Models With Many Controls," Journal of Business & Economic Statistics, Taylor & Francis Journals, vol. 34(4), pages 606-619, October.
    2. Alexandre Belloni & Victor Chernozhukov & Kengo Kato, 2019. "Valid Post-Selection Inference in High-Dimensional Approximately Sparse Quantile Regression Models," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 114(526), pages 749-758, April.
    3. Everding, Jakob & Marcus, Jan, 2020. "The effect of unemployment on the smoking behavior of couples," EconStor Open Access Articles and Book Chapters, ZBW - Leibniz Information Centre for Economics, vol. 29(2), pages 154-170.
    4. Bryan T. Kelly & Asaf Manela & Alan Moreira, 2019. "Text Selection," NBER Working Papers 26517, National Bureau of Economic Research, Inc.
    5. Hansen, Christian & Liao, Yuan, 2019. "The Factor-Lasso And K-Step Bootstrap Approach For Inference In High-Dimensional Economic Applications," Econometric Theory, Cambridge University Press, vol. 35(3), pages 465-509, June.
    6. Alexandre Belloni & Victor Chernozhukov & Kengo Kato, 2013. "Uniform post selection inference for LAD regression and other z-estimation problems," CeMMAP working papers CWP74/13, Centre for Microdata Methods and Practice, Institute for Fiscal Studies.
    7. Frank Windmeijer & Helmut Farbmacher & Neil Davies & George Davey Smith, 2019. "On the Use of the Lasso for Instrumental Variables Estimation with Some Invalid Instruments," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 114(527), pages 1339-1350, July.
    8. Lee, Ji Hyung & Shi, Zhentao & Gao, Zhan, 2022. "On LASSO for predictive regression," Journal of Econometrics, Elsevier, vol. 229(2), pages 322-349.
    9. Philipp Bach & Victor Chernozhukov & Malte S. Kurz & Martin Spindler & Sven Klaassen, 2021. "DoubleML -- An Object-Oriented Implementation of Double Machine Learning in R," Papers 2103.09603, arXiv.org, revised Jun 2024.
    10. Brito, Igor R.S. & Oliveira, Alessandro V.M. & Dresner, Martin E., 2021. "An econometric study of the effects of airport privatization on airfares in Brazil," Transport Policy, Elsevier, vol. 114(C), pages 338-349.
    11. Santos, Luca J. & Oliveira, Alessandro V.M. & Aldrighi, Dante Mendes, 2021. "Testing the differentiated impact of the COVID-19 pandemic on air travel demand considering social inclusion," Journal of Air Transport Management, Elsevier, vol. 94(C).
    12. Agboola, Oluwagbenga David & Yu, Han, 2023. "Neighborhood-based cross fitting approach to treatment effects with high-dimensional data," Computational Statistics & Data Analysis, Elsevier, vol. 186(C).
    13. Neng-Chieh Chang, 2020. "The Mode Treatment Effect," Papers 2007.11606, arXiv.org.
    14. Franz Huber & Alan Ponce & Francesco Rentocchini & Thomas Wainwright, 2020. "The Wealth of (Open Data) Nations? Examining the interplay of open government data and country-level institutions for entrepreneurial activity at the country-level," SEEDS Working Papers 1120, SEEDS, Sustainability Environmental Economics and Dynamics Studies, revised Nov 2020.
    15. Jelena Bradic & Victor Chernozhukov & Whitney K. Newey & Yinchu Zhu, 2019. "Minimax Semiparametric Learning With Approximate Sparsity," Papers 1912.12213, arXiv.org, revised Aug 2022.
    16. Damian Kozbur, 2020. "Analysis of Testing‐Based Forward Model Selection," Econometrica, Econometric Society, vol. 88(5), pages 2147-2173, September.
    17. Damian Kozbur, 2017. "Testing-Based Forward Model Selection," American Economic Review, American Economic Association, vol. 107(5), pages 266-269, May.
    18. Luv Sharma & Aravind Chandrasekaran & Elliot Bendoly, 2020. "Does the Office of Patient Experience Matter in Improving Delivery of Care?," Production and Operations Management, Production and Operations Management Society, vol. 29(4), pages 833-855, April.
    19. Timothy B. Armstrong & Michal Kolesár & Soonwoo Kwon, 2020. "Bias-Aware Inference in Regularized Regression Models," Working Papers 2020-2, Princeton University. Economics Department..
    20. Oliveira, Alessandro V.M. & Oliveira, Bruno F. & Vassallo, Moisés D., 2023. "Airport service quality perception and flight delays: Examining the influence of psychosituational latent traits of respondents in passenger satisfaction surveys," Research in Transportation Economics, Elsevier, vol. 102(C).

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:inm:ormnsc:v:67:y:2021:i:5:p:2964-2984. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Chris Asher (email available below). General contact details of provider: https://edirc.repec.org/data/inforea.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.