IDEAS home Printed from https://ideas.repec.org/a/inm/oropre/v72y2024i2p699-716.html
   My bibliography  Save this article

Reliable Off-Policy Evaluation for Reinforcement Learning

Author

Listed:
  • Jie Wang

    (School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen 518172, China)

  • Rui Gao

    (Department of Information, Risk and Operations Management, The University of Texas at Austin, Austin, Texas 78705)

  • Hongyuan Zha

    (School of Data Science, Shenzhen Institute of Artificial Intelligence and Robotics for Society, The Chinese University of Hong Kong, Shenzhen 518172, China)

Abstract

In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy using logged trajectory data generated from a different behavior policy, without execution of the target policy. Reinforcement learning in high-stake environments, such as healthcare and education, is often limited to off-policy settings due to safety or ethical concerns or inability of exploration. Hence, it is imperative to quantify the uncertainty of the off-policy estimate before deployment of the target policy. In this paper, we propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged trajectories data. Leveraging methodologies from distributionally robust optimization, we show that with proper selection of the size of the distributional uncertainty set, these estimates serve as confidence bounds with nonasymptotic and asymptotic guarantees under stochastic or adversarial environments. Our results are also generalized to batch reinforcement learning and are supported by empirical analysis.

Suggested Citation

  • Jie Wang & Rui Gao & Hongyuan Zha, 2024. "Reliable Off-Policy Evaluation for Reinforcement Learning," Operations Research, INFORMS, vol. 72(2), pages 699-716, March.
  • Handle: RePEc:inm:oropre:v:72:y:2024:i:2:p:699-716
    DOI: 10.1287/opre.2022.2382
    as

    Download full text from publisher

    File URL: http://dx.doi.org/10.1287/opre.2022.2382
    Download Restriction: no

    File URL: https://libkey.io/10.1287/opre.2022.2382?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:inm:oropre:v:72:y:2024:i:2:p:699-716. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Chris Asher (email available below). General contact details of provider: https://edirc.repec.org/data/inforea.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.