Bias Optimality for Multichain Markov Decision Processes

Authors:	Cao Xi-Ren, The Hong Kong University of Science and Technology, Hong Kong Zhang Junyu, The Hong Kong University of Science and Technology, Hong Kong
Topic:	1.4 Stochastic Systems
Session:	Control, Estimation and Analysis of Stochastic Systems
Keywords:	Markov decision processes, Gain optimal, Bias optimal, Policy iteration, Discrete event systems

Abstract

In recent research we find that the policy iteration algorithm for Markov decision processes (MDPs)is a natural consequence of the performance difference formula that compares the difference of theperformance of two different policies. In this paper, we extend this idea to the bias-optimalpolicy of MDPs. We first derive a formula that compares the biases of any two policies which havethe same gains, and then we show that a policy iteration algorithm leading to a bias-optimal policyfollows naturally from this bias difference formula. Our results extend those in (Lewis & Puterman, 2001) to the multichain case and provide a simple and intuitive explanation for the mathematics in(Veinott, 1966; Veinott, 1969)The results also confirm the idea that the solutions to performance (including bias) optimal problems can be obtained from performance sensitivity formulas.