CertRL: Formalizing Convergence Proofs for Value and Policy Iteration in Coq
Reinforcement learning algorithms solve sequential decision-making problems in probabilistic environments by optimizing for long-term reward. The desire to use reinforcement learning in safety-critical settings inspires a recent line of work on formally constrained reinforcement learning; however, these methods place the implementation of the learning algorithm in their Trusted Computing Base. The crucial correctness property of these implementations is a guarantee that the learning algorithm converges to an optimal policy.
This paper begins the work of closing this gap by developing a Coq formalization of two canonical reinforcement learning algorithms: value and policy iteration for finite state Markov decision processes. The central results are a formalization of Bellman’s optimality principle and its proof, which uses a contraction property of Bellman optimality operator to establish that a sequence converges in the infinite horizon limit. The CertRL development exemplifies the Giry monad and mechanized metric coinduction streamline optimality proofs for reinforcement learning algorithms. The CertRL library provides a general framework for proving properties about Markov decision processes and reinforcement learning algorithms, paving the way for further work on formalization of reinforcement learning algorithms.
Tue 19 JanDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
18:00 - 18:30
|A Formal Proof of PAC Learnability for Decision Stumps|
Joseph Tassarotti Boston College, Koundinya Vajjha University of Pittsburgh, Anindya Banerjee IMDEA Software Institute, Jean-Baptiste Tristan Boston CollegePre-print Media Attached
|CertRL: Formalizing Convergence Proofs for Value and Policy Iteration in Coq|
Koundinya Vajjha University of Pittsburgh, Avraham Shinnar IBM Research, Barry Trager IBM Research, Vasily Pestun IBM Research; IHES, Nathan Fulton IBM ResearchPre-print Media Attached