The Simplex and Policy-Iteration Methods are Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate

Yinyu Ye
Stanford University

We prove that the classic policy-iteration method (Howard 1960), including the
Simplex method (Dantzig 1947) with the most-negative-reduced-cost pivoting rule, is
a strongly polynomial-time algorithm for solving the Markov decision problem (MDP)
with a fixed discount rate. Furthermore, the computational complexity of the policy-iteration
method (including the Simplex method) is superior to that of the only known
strongly polynomial-time interior-point algorithm ([28] 2005) for solving this problem.
The result is surprising since the Simplex method with the same pivoting rule was
shown to be exponential for solving a general linear programming (LP) problem,
the Simplex (or simple policy-iteration) method with the smallest-index pivoting
rule was shown to be exponential for solving an MDP regardless of discount rates,
and the policy-iteration method was recently shown to be exponential for solving a
undiscounted MDP. We also extend the result to solving MDPs with sub-stochastic
and transient state transition probability matrices.

Presentation (PDF File)

Back to Efficiency of the Simplex Method: Quo vadis Hirsch conjecture?