Conference item
About time: model-free reinforcement learning with timed reward machines
- Abstract:
-
Reward specification plays a central role in reinforcement learning (RL), guiding the agent’s behavior. To express non-Markovian rewards, formalisms such as reward machines have been introduced to capture dependencies on histories. However, traditional reward machines lack the ability to model precise timing constraints, limiting their use in time-sensitive applications. In this paper, we propose timed reward machines (TRMs), which are an extension of reward machines that incorporate timing constraints into the reward structure. TRMs enable more expressive specifications with tunable reward logic, for example, imposing costs for delays and granting rewards for timely actions. We study model-free RL frameworks (i.e., tabular Qlearning) for learning optimal policies with TRMs under digital and real-time semantics. Our algorithms integrate the TRM into learning via abstractions of timed automata and employ counterfactualimagining heuristics that exploit the TRM’s structure to improve search. Experimentally, we demonstrate that our algorithm learns policies that achieve high rewards while satisfying the timing constraints specified by the TRM on popular RL benchmarks.
Reward specification plays a central role in reinforcement learning (RL), guiding the agent’s behavior. To express non-Markovian rewards, formalisms such as reward machines have been introduced to capture dependencies on histories. However, traditional reward machines lack the ability to model precise timing constraints, limiting their use in time-sensitive applications. In this paper, we propose timed reward machines (TRMs), which are an extension of reward machines that incorporate timing constraints into the reward structure. TRMs enable more expressive specifications with tunable reward logic, for example, imposing costs for delays and granting rewards for timely actions. We study model-free RL frameworks (i.e., tabular Qlearning) for learning optimal policies with TRMs under digital and real-time semantics. Our algorithms integrate the TRM into learning via abstractions of timed automata and employ counterfactualimagining heuristics that exploit the TRM’s structure to improve search. Experimentally, we demonstrate that our algorithm learns policies that achieve high rewards while satisfying the timing constraints specified by the TRM on popular RL benchmarks.
- Publication status:
- Accepted
- Peer review status:
- Peer reviewed
Actions
Authors
- Acceptance date:
- 2026-04-30
- Event title:
- 35th International Joint Conference on Artificial Intelligence (IJCAI 2026)
- Event location:
- Bremen, Germany
- Event website:
- https://2026.ijcai.org/
- Event start date:
- 2026-08-15
- Event end date:
- 2026-08-21
- Language:
-
English
- Pubs id:
-
2421110
- Local pid:
-
pubs:2421110
- Deposit date:
-
2026-05-18
- ARK identifier:
Terms of use
- Notes:
- This conference paper has been accepted for presentation at the 35th International Joint Conference on Artificial Intelligence.
If you are the owner of this record, you can report an update to it here: Report update to this record