IDEAS home Printed from https://ideas.repec.org/p/ehl/lserod/122756.html
   My bibliography  Save this paper

Value enhancement of reinforcement learning via efficient and robust trust region optimization

Author

Listed:
  • Shi, Chengchun
  • Qi, Zhengling
  • Wang, Jianing
  • Zhou, Fan

Abstract

Reinforcement learning (RL) is a powerful machine learning technique that enables an intelligent agent to learn an optimal policy that maximizes the cumulative rewards in sequential decision making. Most of methods in the existing literature are developed in online settings where the data are easy to collect or simulate. Motivated by high stake domains such as mobile health studies with limited and pre-collected data, in this article, we study offline reinforcement learning methods. To efficiently use these datasets for policy optimization, we propose a novel value enhancement method to improve the performance of a given initial policy computed by existing state-of-the-art RL algorithms. Specifically, when the initial policy is not consistent, our method will output a policy whose value is no worse and often better than that of the initial policy. When the initial policy is consistent, under some mild conditions, our method will yield a policy whose value converges to the optimal one at a faster rate than the initial policy, achieving the desired“value enhancement” property. The proposed method is generally applicable to any parameterized policy that belongs to certain pre-specified function class (e.g., deep neural networks). Extensive numerical studies are conducted to demonstrate the superior performance of our method. Supplementary materials for this article are available online.

Suggested Citation

  • Shi, Chengchun & Qi, Zhengling & Wang, Jianing & Zhou, Fan, 2023. "Value enhancement of reinforcement learning via efficient and robust trust region optimization," LSE Research Online Documents on Economics 122756, London School of Economics and Political Science, LSE Library.
  • Handle: RePEc:ehl:lserod:122756
    as

    Download full text from publisher

    File URL: http://eprints.lse.ac.uk/122756/
    File Function: Open access version.
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Linbo Wang & Eric Tchetgen Tchetgen, 2018. "Bounded, efficient and multiply robust estimation of average treatment effects using instrumental variables," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 80(3), pages 531-550, June.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Mao, Lu, 2022. "Identification of the outcome distribution and sensitivity analysis under weak confounder–instrument interaction," Statistics & Probability Letters, Elsevier, vol. 189(C).
    2. Yan Liu, 2022. "Policy Learning under Endogeneity Using Instrumental Variables," Papers 2206.09883, arXiv.org, revised Mar 2024.
    3. Benjamin R. Baer & Robert L. Strawderman & Ashkan Ertefaie, 2023. "Discussion on “Instrumental variable estimation of the causal hazard ratio,” by Linbo Wang, Eric Tchetgen Tchetgen, Torben Martinussen, and Stijn Vansteelandt," Biometrics, The International Biometric Society, vol. 79(2), pages 554-558, June.
    4. Ting Ye & Ashkan Ertefaie & James Flory & Sean Hennessy & Dylan S. Small, 2023. "Instrumented difference‐in‐differences," Biometrics, The International Biometric Society, vol. 79(2), pages 569-581, June.
    5. Dingke Tang & Dehan Kong & Wenliang Pan & Linbo Wang, 2023. "Ultra‐high dimensional variable selection for doubly robust causal inference," Biometrics, The International Biometric Society, vol. 79(2), pages 903-914, June.
    6. Shaojie Wei & Chao Zhang & Zhi Geng & Shanshan Luo, 2024. "Identifiability and Estimation for Potential-Outcome Means with Misclassified Outcomes," Mathematics, MDPI, vol. 12(18), pages 1-19, September.
    7. Abhinandan Dalal & Patrick Blobaum & Shiva Kasiviswanathan & Aaditya Ramdas, 2024. "Anytime-Valid Inference for Double/Debiased Machine Learning of Causal Parameters," Papers 2408.09598, arXiv.org, revised Sep 2024.
    8. Linbo Wang & Eric Tchetgen Tchetgen & Torben Martinussen & Stijn Vansteelandt, 2023. "Instrumental variable estimation of the causal hazard ratio," Biometrics, The International Biometric Society, vol. 79(2), pages 539-550, June.
    9. Shixiao Zhang & Peisong Han & Changbao Wu, 2023. "Calibration Techniques Encompassing Survey Sampling, Missing Data Analysis and Causal Inference," International Statistical Review, International Statistical Institute, vol. 91(2), pages 165-192, August.
    10. Hongming Pu & Bo Zhang, 2021. "Estimating optimal treatment rules with an instrumental variable: A partial identification learning approach," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 83(2), pages 318-345, April.
    11. Martin Emil Jakobsen & Jonas Peters, 2020. "Distributional robustness of K-class estimators and the PULSE," Papers 2005.03353, arXiv.org, revised Mar 2022.
    12. Choi, Jin-young & Lee, Goeun & Lee, Myoung-jae, 2023. "Endogenous treatment effect for any response conditional on control propensity score," Statistics & Probability Letters, Elsevier, vol. 196(C).
    13. Cui, Yifan & Tchetgen Tchetgen, Eric, 2021. "On a necessary and sufficient identification condition of optimal treatment regimes with an instrumental variable," Statistics & Probability Letters, Elsevier, vol. 178(C).
    14. Linbo Wang & Eric Tchetgen Tchetgen & Torben Martinussen & Stijn Vansteelandt, 2023. "Rejoinder to discussions on “Instrumental variable estimation of the causal hazard ratio”," Biometrics, The International Biometric Society, vol. 79(2), pages 564-568, June.
    15. Zhichao Jiang & Shu Yang & Peng Ding, 2022. "Multiply robust estimation of causal effects under principal ignorability," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 84(4), pages 1423-1445, September.
    16. Myoung‐jae Lee, 2021. "Instrument residual estimator for any response variable with endogenous binary treatment," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 83(3), pages 612-635, July.
    17. Shuxiao Chen & Bo Zhang, 2021. "Estimating and Improving Dynamic Treatment Regimes With a Time-Varying Instrumental Variable," Papers 2104.07822, arXiv.org.
    18. Haoyu Wei & Hengrui Cai & Chengchun Shi & Rui Song, 2024. "On Efficient Inference of Causal Effects with Multiple Mediators," Papers 2401.05517, arXiv.org.

    More about this item

    Keywords

    mobile health studies; offline reinforcement learning; semi-parametric efficiency; trust region optimization;
    All these keywords.

    JEL classification:

    • C1 - Mathematical and Quantitative Methods - - Econometric and Statistical Methods and Methodology: General

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:ehl:lserod:122756. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: LSERO Manager (email available below). General contact details of provider: https://edirc.repec.org/data/lsepsuk.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.