Optimization is a key component of a large class of machine learning techniques. Much of the state of practice in optimization for machine learning relies on first order (gradient-based) methods. However, recent theoretical results by others and us, have shown that second order methods (that use the Hessian, in addition to the gradient) provide a number of desirable characteristics, and that their computational cost can be significantly reduced by suitably approximating the Hessian. Building on these theoretical results, we present robust high-performance implementations that yield important benefits in the context of training commonly used deep network models on parallel and distributed computing platforms. Our approximate Newton methods rely on sampling to approximate the Hessian-vector product for computing the inverse of the Hessian, ADMM for parallel/ distributed implementation, and a number of lower-level algorithmic optimizations for efficient use of GPUs. We show that our solvers retain desirable convergence properties of the full Newton method, low per-iteration cost of gradient-based methods, and robustness to problem ill conditioning and hyper-parameter tuning. We demonstrate that our algorithms and implementations significantly outperform existing solutions in terms of robustness, computational cost, and parallel performance.
Back to Workshop III: HPC for Computationally and Data-Intensive Problems