Breaking the deadly triad in reinforcement learning

Zhang, S

Thesis

Breaking the deadly triad in reinforcement learning

Abstract:: Reinforcement Learning (RL) is a promising framework for solving sequential decision making problems emerging from agent-environment interactions via trial and error. Off-policy learning is one of the most important techniques in RL, which enables an RL agent to learn from agent-environment interactions generated by a policy (i.e, a decision making rule that an agent relies on to interact with the environment) that is different from the policy of interest. Arguably, this flexibility is key to applying RL to real-world problems. Off-policy learning, however, often leads to instability of RL algorithms, if combined with function approximation (i.e., using a parameterized function to represent quantities of interest) and bootstrapping (i.e., recursively constructing a learning target for an estimator by using the estimator itself), two arguably indispensable ingredients for large-scale RL applications. This instability, resulting from the combination of off-policy learning, function approximation, and bootstrapping, is the notorious deadly triad in RL.

In this thesis, we propose several novel RL algorithms theoretically addressing the deadly triad. The proposed algorithms cover a wide range of RL settings (e.g., both prediction and control, both value-based and policy-based methods, both discounted and average-reward performance metrics). By contrast, existing methods address this issue in only a few RL settings, where our methods also exhibit several advantages over existing ones, e.g., reduced variance, improved asymptotic performance guarantee. These improvements are made possible by the use of several advanced tools (e.g., target networks, differential value functions, density ratios, and truncated followon traces). Importantly, the proposed algorithms remain fully incremental and computationally efficient, making them readily available for large-scale RL applications.

Besides the theoretical contributions in breaking the deadly triad, we also make empirical contributions by introducing a bi-directional target network that scales up residual algorithms, a family of RL algorithms that break the deadly triad in some restricted settings.

Actions

Email

Email this record

Send the bibliographic details of this record to your email address.

Your Email
Please enter the email address that the record information will be sent to.

-
Your message (optional)
Please add any additional information to be included within the email.
Cite

Cite this record

APA Style

Zhang, S. (2022). Breaking the deadly triad in reinforcement learning [PhD thesis]. University of Oxford.

MLA Style

Zhang, S. Breaking the Deadly Triad in Reinforcement Learning. University of Oxford, 2022.

Chicago Style

Zhang, S. 2022. “Breaking the Deadly Triad in Reinforcement Learning.” PhD thesis, University of Oxford.
Share
Print