Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

Friday, November 2, 2018, 11:11 am - 12:00 pm PDTiCal
10th floor conference room (1016)
This event is open to the public.
AI Seminar
Quanquan Gu, UCLA
Video Recording:

Adaptive gradient methods such as Adam, which adopt historical gradient information to automatically adjust the learning rate, have been observed to generalize worse than stochastic gradient descent (SGD) with momentum in training deep neural networks. This leaves how to close the generalization gap of adaptive gradient methods an open problem. In this talk, I will introduce a new algorithm, called Partially adaptive momentum estimation method (Padam), which unifies Adam/Amsgrad with SGD to achieve the best of both worlds. We prove that, for smooth nonconvex functions, Padam is guaranteed to converge to a stationary point. Our theoretical result suggests that in order to achieve faster convergence rate, it is necessary to use Padam instead of Adam/AMSGrad. Experiments on image classification benchmarks show that Padam can maintain fast convergence rate as Adam/Amsgrad while generalizing as well as SGD with momentum in training deep neural networks. This suggests practitioners pick up adaptive gradient methods once again for faster training of deep neural networks.

Bio: Quanquan Gu is an Assistant Professor of Computer Science at UCLA. His current research is in the area of artificial intelligence and machine learning, with a focus on large-scale nonconvex optimization for machine learning and high-dimensional statistical inference. Dr. Gu is a recipient of the NSF CAREER Award (2017) and the Yahoo! Academic Career Enhancement Award (2015). He received his Ph.D. degree in Computer Science from the University of Illinois at Urbana-Champaign in 2014.

« Return to Upcoming Events