Ensemble methods for reinforcement learning have gained attention in recent years, due to their ability to represent model uncertainty and use it to guide exploration and to reduce value estimation bias. We present MeanQ, a very simple ensemble method with record-setting performance, and show how it reduces estimation variance enough to operate without a stabilizing target network. Curiously, MeanQ is theoretically *almost* equivalent to a non-ensemble state-of-the-art method that it significantly outperforms, raising questions about the interaction between uncertainty estimation, representation, and resampling.
In adversarial environments, where a second agent attempts to minimize the first’s rewards, double-oracle (DO) methods grow a population of policies for both agents by iteratively adding the best response to the current population. DO algorithms are guaranteed to converge when they exhaust all policies, but are only efficient when they find a small population that induces a good agent. We present XDO, a DO algorithm that exploits the game’s sequential structure to exponentially reduce the worst-case population size. Curiously, the small population size more than compensates for the algorithm’s increased complexity per iteration.
Roy Fox is an Assistant Professor and director of the Intelligent Dynamics Lab at the Department of Computer Science at UCI. His research interests include theory and applications of reinforcement learning, algorithmic game theory, information theory, and robotics. His current research focuses on structure, exploration, and optimization in deep reinforcement learning and imitation learning of virtual and physical agents and multi-agent systems. He was previously a postdoc at UC Berkeley, where he developed algorithms and systems that interact with humans to learn structured control policies for robotics and program synthesis.