The next milestone for machine learning is the ability to train on massively large datasets. However, the de facto method used for training neural networks is stochastic gradient descent, which is not amenable to scaling without expensive hyper-parmaeter tuning. One approach to address the challenge of large scale training, is to use large mini-batch sizes which allows parallel training. However, large batch size training with SGD often results in models with poor generalization performance and poor robustness. The methods proposed so far to address this only work for special cases, and often times require hyper-parameter tuning themselves.
Here, we will introduce a novel Hessian based method which in combination with robust optimization avoids many of the aforementioned issues. Extensive testing of the method on different neural networks and datasets show significant improvements as compared to state-of-the-art.
Back to Workshop IV: New Architectures and Algorithms