How much data is sufficient to learn high-performing algorithms?

Ellen Vitercik
Carnegie Mellon University

Algorithms often have tunable parameters that have a considerable impact on their runtime and solution quality. A growing body of research has demonstrated that data-driven algorithm design can lead to significant gains in runtime and solution quality. Data-driven algorithm design uses a training set of problem instances sampled from an unknown, application-specific distribution and returns a parameter setting with strong average performance on the training set. We provide a broadly applicable theory for deriving generalization guarantees for data-driven algorithm design, which bound the difference between the algorithm's expected performance and its average performance over the training set. The challenge is that for many combinatorial algorithms, performance is a volatile function of the parameters: slightly perturbing the parameters can cause a cascade of changes in the algorithm’s behavior. Prior research has proved generalization bounds by employing case-by-case analyses of parameterized greedy algorithms, clustering algorithms, integer programming algorithms, and selling mechanisms. We uncover a unifying structure which we use to prove extremely general guarantees, yet we recover the bounds from prior research. Our guarantees apply whenever an algorithm's performance is a piecewise-constant, -linear, or—more generally—piecewise-structured function of its parameters. As we demonstrate, our theory also implies novel bounds for dynamic programming algorithms used in computational biology.

Presentation (PDF File)

Back to Long Programs