IDEAS home Printed from https://ideas.repec.org/p/arx/papers/2111.11232.html
   My bibliography  Save this paper

Policy Gradient and Actor-Critic Learning in Continuous Time and Space: Theory and Algorithms

Author

Listed:
  • Yanwei Jia
  • Xun Yu Zhou

Abstract

We study policy gradient (PG) for reinforcement learning in continuous time and space under the regularized exploratory formulation developed by Wang et al. (2020). We represent the gradient of the value function with respect to a given parameterized stochastic policy as the expected integration of an auxiliary running reward function that can be evaluated using samples and the current value function. This effectively turns PG into a policy evaluation (PE) problem, enabling us to apply the martingale approach recently developed by Jia and Zhou (2021) for PE to solve our PG problem. Based on this analysis, we propose two types of the actor-critic algorithms for RL, where we learn and update value functions and policies simultaneously and alternatingly. The first type is based directly on the aforementioned representation which involves future trajectories and hence is offline. The second type, designed for online learning, employs the first-order condition of the policy gradient and turns it into martingale orthogonality conditions. These conditions are then incorporated using stochastic approximation when updating policies. Finally, we demonstrate the algorithms by simulations in two concrete examples.

Suggested Citation

  • Yanwei Jia & Xun Yu Zhou, 2021. "Policy Gradient and Actor-Critic Learning in Continuous Time and Space: Theory and Algorithms," Papers 2111.11232, arXiv.org, revised Jul 2022.
  • Handle: RePEc:arx:papers:2111.11232
    as

    Download full text from publisher

    File URL: http://arxiv.org/pdf/2111.11232
    File Function: Latest version
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. R. H. Strotz, 1955. "Myopia and Inconsistency in Dynamic Utility Maximization," The Review of Economic Studies, Review of Economic Studies Ltd, vol. 23(3), pages 165-180.
    2. Duan Li & Wan‐Lung Ng, 2000. "Optimal Dynamic Portfolio Selection: Multiperiod Mean‐Variance Formulation," Mathematical Finance, Wiley Blackwell, vol. 10(3), pages 387-406, July.
    3. Min Dai & Hanqing Jin & Steven Kou & Yuhong Xu, 2021. "A Dynamic Mean-Variance Analysis for Log Returns," Management Science, INFORMS, vol. 67(2), pages 1093-1108, February.
    4. Suleyman Basak & Georgy Chabakauri, 2010. "Dynamic Mean-Variance Asset Allocation," The Review of Financial Studies, Society for Financial Studies, vol. 23(8), pages 2970-3016, August.
    5. David Silver & Julian Schrittwieser & Karen Simonyan & Ioannis Antonoglou & Aja Huang & Arthur Guez & Thomas Hubert & Lucas Baker & Matthew Lai & Adrian Bolton & Yutian Chen & Timothy Lillicrap & Fan , 2017. "Mastering the game of Go without human knowledge," Nature, Nature, vol. 550(7676), pages 354-359, October.
    6. Yanwei Jia & Xun Yu Zhou, 2021. "Policy Evaluation and Temporal-Difference Learning in Continuous Time and Space: A Martingale Approach," Papers 2108.06655, arXiv.org, revised Feb 2022.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Zhou Fang, 2023. "Continuous-Time Path-Dependent Exploratory Mean-Variance Portfolio Construction," Papers 2303.02298, arXiv.org.
    2. Wu, Bo & Li, Lingfei, 2024. "Reinforcement learning for continuous-time mean-variance portfolio selection in a regime-switching market," Journal of Economic Dynamics and Control, Elsevier, vol. 158(C).
    3. Yanwei Jia, 2024. "Continuous-time Risk-sensitive Reinforcement Learning via Quadratic Variation Penalty," Papers 2404.12598, arXiv.org.
    4. Jodi Dianetti & Giorgio Ferrari & Renyuan Xu, 2024. "Exploratory Optimal Stopping: A Singular Control Formulation," Papers 2408.09335, arXiv.org, revised Oct 2024.
    5. Xiangyu Cui & Xun Li & Yun Shi & Si Zhao, 2023. "Discrete-Time Mean-Variance Strategy Based on Reinforcement Learning," Papers 2312.15385, arXiv.org.
    6. Zhou Fang & Haiqing Xu, 2023. "Market Making of Options via Reinforcement Learning," Papers 2307.01814, arXiv.org.
    7. Min Dai & Yu Sun & Zuo Quan Xu & Xun Yu Zhou, 2024. "Learning to Optimally Stop Diffusion Processes, with Financial Applications," Papers 2408.09242, arXiv.org, revised Sep 2024.
    8. Zhou Fang & Haiqing Xu, 2023. "Over-the-Counter Market Making via Reinforcement Learning," Papers 2307.01816, arXiv.org.
    9. Yanwei Jia & Xun Yu Zhou, 2022. "q-Learning in Continuous Time," Papers 2207.00713, arXiv.org, revised Apr 2023.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. De Gennaro Aquino, Luca & Sornette, Didier & Strub, Moris S., 2023. "Portfolio selection with exploration of new investment assets," European Journal of Operational Research, Elsevier, vol. 310(2), pages 773-792.
    2. Xiang Meng, 2019. "Dynamic Mean-Variance Portfolio Optimisation," Papers 1907.03093, arXiv.org.
    3. Xiangyu Cui & Xun Li & Duan Li & Yun Shi, 2014. "Time Consistent Behavior Portfolio Policy for Dynamic Mean-Variance Formulation," Papers 1408.6070, arXiv.org, revised Aug 2015.
    4. Ben Hambly & Renyuan Xu & Huining Yang, 2021. "Recent Advances in Reinforcement Learning in Finance," Papers 2112.04553, arXiv.org, revised Feb 2023.
    5. Li, Yongwu & Li, Zhongfei, 2013. "Optimal time-consistent investment and reinsurance strategies for mean–variance insurers with state dependent risk aversion," Insurance: Mathematics and Economics, Elsevier, vol. 53(1), pages 86-97.
    6. Felix Fie{ss}inger & Mitja Stadje, 2023. "Time-Consistent Asset Allocation for Risk Measures in a L\'evy Market," Papers 2305.09471, arXiv.org, revised Oct 2024.
    7. Xue Dong He & Xun Yu Zhou, 2021. "Who Are I: Time Inconsistency and Intrapersonal Conflict and Reconciliation," Papers 2105.01829, arXiv.org.
    8. Agostino Capponi & Sveinn Ólafsson & Thaleia Zariphopoulou, 2022. "Personalized Robo-Advising: Enhancing Investment Through Client Interaction," Management Science, INFORMS, vol. 68(4), pages 2485-2512, April.
    9. Ma, Shuai & Ma, Xiaoteng & Xia, Li, 2023. "A unified algorithm framework for mean-variance optimization in discounted Markov decision processes," European Journal of Operational Research, Elsevier, vol. 311(3), pages 1057-1067.
    10. Dong-Mei Zhu & Jia-Wen Gu & Feng-Hui Yu & Tak-Kuen Siu & Wai-Ki Ching, 2021. "Optimal pairs trading with dynamic mean-variance objective," Mathematical Methods of Operations Research, Springer;Gesellschaft für Operations Research (GOR);Nederlands Genootschap voor Besliskunde (NGB), vol. 94(1), pages 145-168, August.
    11. Tomas Björk & Agatha Murgoci & Xun Yu Zhou, 2014. "Mean–Variance Portfolio Optimization With State-Dependent Risk Aversion," Mathematical Finance, Wiley Blackwell, vol. 24(1), pages 1-24, January.
    12. Keffert, Henk, 2024. "Robo-advising: Optimal investment with mismeasured and unstable risk preferences," European Journal of Operational Research, Elsevier, vol. 315(1), pages 378-392.
    13. Wei, Jiaqin & Wang, Tianxiao, 2017. "Time-consistent mean–variance asset–liability management with random coefficients," Insurance: Mathematics and Economics, Elsevier, vol. 77(C), pages 84-96.
    14. Yuchen Li & Zongxia Liang & Shunzhi Pang, 2022. "Continuous-Time Monotone Mean-Variance Portfolio Selection in Jump-Diffusion Model," Papers 2211.12168, arXiv.org, revised May 2024.
    15. Chi Kin Lam & Yuhong Xu & Guosheng Yin, 2016. "Dynamic portfolio selection without risk-free assets," Papers 1602.04975, arXiv.org.
    16. Zhang, Jingong & Tan, Ken Seng & Weng, Chengguo, 2017. "Optimal hedging with basis risk under mean–variance criterion," Insurance: Mathematics and Economics, Elsevier, vol. 75(C), pages 1-15.
    17. Zhou Fang, 2023. "Continuous-Time Path-Dependent Exploratory Mean-Variance Portfolio Construction," Papers 2303.02298, arXiv.org.
    18. Luca De Gennaro Aquino & Sascha Desmettre & Yevhen Havrylenko & Mogens Steffensen, 2024. "Equilibrium control theory for Kihlstrom-Mirman preferences in continuous time," Papers 2407.16525, arXiv.org, revised Oct 2024.
    19. Fahrenwaldt, Matthias Albrecht & Jensen, Ninna Reitzel & Steffensen, Mogens, 2020. "Nonrecursive separation of risk and time preferences," Journal of Mathematical Economics, Elsevier, vol. 90(C), pages 95-108.
    20. Liyuan Wang & Zhiping Chen, 2019. "Stochastic Game Theoretic Formulation for a Multi-Period DC Pension Plan with State-Dependent Risk Aversion," Mathematics, MDPI, vol. 7(1), pages 1-16, January.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2111.11232. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.