To compensate for the optimal Q-value estimation error and thus search out better policies,
we propose a provably efficient and parallelizable planning algorithm
and derive the condition under which the search-based optimal Q-values have a lower upper-bound of error than TD learning-based optimal Q-values.
The planning helps optimal inference during evaluation and sample-efficient transfer to novel games.
We first model the process of finding optimal actions within the imagined Markov Decision Process as a tree search problem,
and then extend beam search as the practical and parallelizable planning algorithm.
When both the beam width K and horizon H equal to 2, the process of planning is shown in the right figure: