Consider the Distributed Stochastic Optimization problem where each of several machines has access to samples from the same source distribution, and the goal is to jointly optimize the expected objective w.r.t. the source distribution, minimizing: (1) overall runtime; (2) communication costs; (3) number of samples used. Stochastic Gradient Descent (SGD) might seem ill-suited for distributed setting due to its sequential nature. I’ll discuss the use of mini-batches for parallelizing SGD and variants, and what this implies about limited-communication distributed stochastic optimization. I’ll contrast the resulting communication, runtime and sample costs with those of other distributed optimization algorithms, including a novel Newton-like approach.
Based on joint work with Avleen Bijral, Andy Cotter, Peter Richtarik, Ohad Shamir, Karthik Sridharan, Martin Takac, and Tong Zhang.
Back to Stochastic Gradient Methods