Conference item icon

Conference item

About time: model-free reinforcement learning with timed reward machines

Abstract:

Reward specification plays a central role in reinforcement learning (RL), guiding the agent’s behavior. To express non-Markovian rewards, formalisms such as reward machines have been introduced to capture dependencies on histories. However, traditional reward machines lack the ability to model precise timing constraints, limiting their use in time-sensitive applications. In this paper, we propose timed reward machines (TRMs), which are an extension of reward machines that incorporate timing constraints into the reward structure. TRMs enable more expressive specifications with tunable reward logic, for example, imposing costs for delays and granting rewards for timely actions. We study model-free RL frameworks (i.e., tabular Qlearning) for learning optimal policies with TRMs under digital and real-time semantics. Our algorithms integrate the TRM into learning via abstractions of timed automata and employ counterfactualimagining heuristics that exploit the TRM’s structure to improve search. Experimentally, we demonstrate that our algorithm learns policies that achieve high rewards while satisfying the timing constraints specified by the TRM on popular RL benchmarks.

Reward specification plays a central role in reinforcement learning (RL), guiding the agent’s behavior. To express non-Markovian rewards, formalisms such as reward machines have been introduced to capture dependencies on histories. However, traditional reward machines lack the ability to model precise timing constraints, limiting their use in time-sensitive applications. In this paper, we propose timed reward machines (TRMs), which are an extension of reward machines that incorporate timing constraints into the reward structure. TRMs enable more expressive specifications with tunable reward logic, for example, imposing costs for delays and granting rewards for timely actions. We study model-free RL frameworks (i.e., tabular Qlearning) for learning optimal policies with TRMs under digital and real-time semantics. Our algorithms integrate the TRM into learning via abstractions of timed automata and employ counterfactualimagining heuristics that exploit the TRM’s structure to improve search. Experimentally, we demonstrate that our algorithm learns policies that achieve high rewards while satisfying the timing constraints specified by the TRM on popular RL benchmarks.

Publication status:
Accepted
Peer review status:
Peer reviewed

Actions

Authors

More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Computer Science
Role:
Author
More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Computer Science
Oxford college:
Trinity College
Role:
Author
ORCID:
0000-0003-4137-8862
More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Computer Science
Role:
Author


Acceptance date:
2026-04-30
Event title:
35th International Joint Conference on Artificial Intelligence (IJCAI 2026)
Event location:
Bremen, Germany
Event website:
https://2026.ijcai.org/
Event start date:
2026-08-15
Event end date:
2026-08-21


Language:
English
Pubs id:
2421110
Local pid:
pubs:2421110
Deposit date:
2026-05-18
ARK identifier:

Terms of use


Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP