Scalable algorithms for optimal training data for Bayesian inference of large scale models

Omar Ghattas
University of Texas at Austin

Bayesian inference provides a systematic and rational framework for
learning complex physical models from complex observational data. In
the resulting Bayesian inverse problem (BIP), we employ observational
data and a model mapping parameters to observations to infer unknown
parameters and quantify their uncertainty, in the form of a posterior
probability distribution. But this begs the question: how do we choose
the data in the first place? The problem of choosing the optimal
training data for a BIP can be formulated as the following optimal
experimental design (OED) problem: optimize the design of a data
acquisition system (sensor locations, what is measured and how often,
etc.) to minimize some measure of the uncertainty in the inferred
parameters.

This constitutes a challenge of the highest order. First, the
parameters to be inferred in the inverse problem are often infinite
dimensional spatially correlated fields, whose discretization yields
high dimensional parameter spaces that can number in the
millions. Second, the physical models to be learned are often complex
PDE systems that are computationally expensive to solve. Third, the
BIP---difficult as it is---is merely an inner problem within the outer
OED problem, to be solved at each optimization iteration. And fourth,
efficient optimization algorithms for the OED problem require
gradients of the OED objective function with respect to the design
parameters, meaning that we must differentiate through the BIP
solution.

An attractive objective function for OED is the expected information
gain (EIG), which can be understood as the expected Kullback--Leibler
divergence between prior and posterior. Naive evaluation of EIG is
intractable for large-scale problems due to the large number of
samples required in double-loop Monte Carlo. Here we invoke an
approximation of the EIG based on the Laplace approximation of the
posterior. Each evaluation of the objective function then requires
computing a sample average approximation (over possible realizations
of the data) of the information gain (IG) between the Laplace
approximation and the Gaussian prior distribution. Since the IG
involves the KL divergence between two Gaussians, it can be expressed
in closed form, in particular in terms of the log-determinant and
trace of the prior-preconditioned posterior covariance operator. A
randomized eigensolver allows us to estimate these invariants at a
cost---measured in number of forward PDE solves---that is independent
of the dimension of the parameter space, resulting in scalable
evaluation of the OED objective. Variational adjoint methods are then
used to efficiently compute the gradient of the OED objective function
with respect to the experimental design variables, thus leading to an
efficient and scalable Laplace-approximation based OED
method. Moreover, the resulting method inherits the fine-grained
parallel efficiency of the forward PDE solver, despite being tightly
coupled. We present numerical results demonstrating the scalability
and efficiency of this approach. In particular, the results
demonstrate that the number of required PDE solves is independent of
the parameter dimension, and only weakly dependent on the data
dimension.

This work is joint with Umberto Villa.


Back to Workshop I: Big Data Meets Large-Scale Computing