Bias Optimality for Multichain Markov Decision Processes
Abstract
In recent research we find that the policy iteration algorithm for Markov decision processes (MDPs)is a natural consequence of the performance difference formula that compares the difference of theperformance of two different policies. In this paper, we extend this idea to the bias-optimalpolicy of MDPs. We first derive a formula that compares the biases of any two policies which havethe same gains, and then we show that a policy iteration algorithm leading to a bias-optimal policyfollows naturally from this bias difference formula. Our results extend those in (Lewis & Puterman, 2001) to the multichain case and provide a simple and intuitive explanation for the mathematics in(Veinott, 1966; Veinott, 1969)The results also confirm the idea that the solutions to performance (including bias) optimal problems can be obtained from performance sensitivity formulas.