Average-reward off-policy policy evaluation with function approximation

Zhang, S; Wan, Y; Sutton, RS; Whiteson, S

Conference item

Average-reward off-policy policy evaluation with function approximation

Abstract:: We consider off-policy policy evaluation with function approximation (FA) in average-reward MDPs, where the goal is to estimate both the reward rate and the differential value function. For this problem, bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad (Sutton & Barto, 2018). To address the deadly triad, we propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting. In terms of estimating the differential value function, the algorithms are the first convergent off-policy linear function approximation algorithms. In terms of estimating the reward rate, the algorithms are the first convergent off-policy linear function approximation algorithms that do not require estimating the density ratio. We demonstrate empirically the advantage of the proposed algorithms, as well as their nonlinear variants, over a competitive density-ratio-based approach, in a simple domain as well as challenging robot simulation tasks.

Publication status:: Published

Peer review status:: Peer reviewed

Actions

Email

Email this record

Send the bibliographic details of this record to your email address.

Your Email
Please enter the email address that the record information will be sent to.

-
Your message (optional)
Please add any additional information to be included within the email.
Cite

Cite this record

APA Style

Zhang, S., Wan, Y., Sutton, R. S., & Whiteson, S. (2021). Average-reward off-policy policy evaluation with function approximation. Proceedings of the 38th International Conference on Machine Learning, 139, 12578–12588.

MLA Style

Zhang, S., et al. “Average-Reward off-Policy Policy Evaluation with Function Approximation.” Proceedings of the 38th International Conference on Machine Learning, vol. 139, PMLR, 2021, pp. 12578–88.

Chicago Style

Zhang, S, Y Wan, RS Sutton, and S Whiteson. 2021. “Average-Reward off-Policy Policy Evaluation with Function Approximation.” In Proceedings of the 38th International Conference on Machine Learning, 139:12578–88. Proceedings of Machine Learning Research. PMLR.
Share
Print

Access Document

Files:: supplementary-material.pdf

(Preview, Supplementary materials, 951.6KB, Terms of use)

Zhang-et-al-2021-Average-reward-off-policy--.pdf

(Preview, Version of record, 882.8KB, Terms of use)

Publication website:: http://proceedings.mlr.press/v139/zhang21u.html

Authors

+ Zhang, S More by this author

Role:: Author

+ Wan, Y More by this author

Role:: Author

+ Sutton, RS More by this author

Role:: Author

+ Whiteson, S More by this author

Institution:: University of Oxford
Department:: COMPUTER SCIENCE
Sub department:: Computer Science
Oxford college:: St Catherines College; St Catherines College; St Catherines College; St Catherines College; St Catherines College; St Catherines College; St Catherines College; St Catherines College; St Catherines College
Role:: Author

+ European Commission More from this funder

Grant:: 637713

Publisher:: PMLR
Host title:: Proceedings of the 38th International Conference on Machine Learning
Volume:: 139
Pages:: 12578-12588
Series:: Proceedings of Machine Learning Research
Publication date:: 2021-07-21
Acceptance date:: 2021-05-08
Event title:: 38th International Conference on Machine Learning (ICML 2021)
Event location:: Virtual Event
Event website:: https://icml.cc/
Event start date:: 2021-07-18
Event end date:: 2021-07-24
ISSN:: 2640-3498

Language:: English
Keywords:: FFR
Pubs id:: 1187447
Local pid:: pubs:1187447
Deposit date:: 2021-07-24

Terms of use

Copyright holder:: Zhang et al.

Licence:: CC Attribution (CC BY)

Views and Downloads

About views and downloads

If you are the owner of this record, you can report an update to it here: Report update to this record

Conference item

Average-reward off-policy policy evaluation with function approximation

Actions

Access Document

Authors

Terms of use

Views and Downloads

Altmetrics

Dimensions

Conference item

Average-reward off-policy policy evaluation with function approximation

Actions

Access Document

Authors

Funding

Bibliographic Details

Item Description

Terms of use

Metrics

Views and Downloads

Altmetrics

Dimensions