Thesis
Safe and certified reinforcement learning with logical constraints
- Abstract:
- Reinforcement Learning (RL) is a widely employed machine learning architecture that has been applied to a variety of decision-making problems, from resource management to robot locomotion, from recommendation systems to systems biology, and from traffic control to superhuman performance in video games. However, RL has experienced limited success beyond rigidly controlled or constrained applications, and successful employment of RL in safety-critical scenarios is yet to be achieved. A principal reason for this limitation is the lack of a systematic formal approach to specify requirements as tasks (or learning constraints), and to provide guarantees with respect to these requirements and constraints, during and after learning. This work addresses these issues by proposing a general method that leverages the success of RL in learning high-performance controllers, while developing the required assumptions and theories for the satisfaction of formal requirements and guiding the learning process within safe configurations. Given the expressive power of Linear Temporal Logic (LTL) in formulating control specifications, we propose the first model-free RL scheme to synthesise policies for unknown black-box Markov Decision Processes (MDPs), where the reward function is formally shaped by an LTL specification. We convert the given property into a Limit Deterministic Buchi Automaton (LDBA), namely a finite-state machine expressing the property. Further, by exploiting the structure of the LDBA, we shape a synchronous reward function on-the-fly, and we discuss the assumptions under which RL is guaranteed to synthesise control policies whose traces satisfy the LTL specification with maximum probability possible. This probability (certificate) can also be calculated in parallel with policy learning when the state space of the MDP is finite: as such, the RL algorithm produces a policy that is certified with respect to the property. We also show that our method produces "best available" control policies when the logical property cannot be satisfied. The performance of the proposed method is evaluated via a set of numerical examples and benchmarks, where we observe an improvement of one order of magnitude in the number of iterations required for the policy synthesis, compared to existing approaches whenever available.
Actions
Authors
Contributors
+ Abate, A
- Division:
- MPLS
- Department:
- Computer Science
- Sub department:
- Computer Science
- Research group:
- OxCAV
- Oxford college:
- St Hugh's College
- Role:
- Supervisor
+ Kroening, D
- Division:
- MPLS
- Department:
- Computer Science
- Sub department:
- Computer Science
- Oxford college:
- Magdalen College
- Role:
- Supervisor
- ORCID:
- 0000-0002-6681-5283
+ HICLASS
More from this funder
- Funding agency for:
- Kroening, D
- Grant:
- 113213
- Programme:
- Aerospace Technology Institute (ATI), Department for Business, Energy & Industrial Strategy (BEIS) and Innovate UK
+ UK National Cyber Security Centre
More from this funder
- Funding agency for:
- Kroening, D
- Programme:
- UK NCSC
- Type of award:
- DPhil
- Level of award:
- Doctoral
- Awarding institution:
- University of Oxford
- Language:
-
English
- Keywords:
- Subjects:
- Deposit date:
-
2021-07-12
Terms of use
- Copyright holder:
- Hasanbeig, MH
- Copyright date:
- 2020
If you are the owner of this record, you can report an update to it here: Report update to this record