Modeling Molecular and Macromolecular Acidity with Quantitative Property-Activity Relations and Machine Learning

Paul Ayers
McMaster University

Thermodynamic properties like equilibrium and rate constants are exquisitely
sensitive to errors in the relative free energies of the chemical species involved.
For example, experimental measurements of acidity in molecules and proteins are
often accurate to .1 pKa unit; in order to compute pKa’s to this accuracy, one must
compute the Gibbs free energy of deprotonation to within .5 kJ/mol (2 × 10–4 a.u.).
This level of computational accuracy is inaccessible for small molecules and
unfathomable for large ones.
Fortunately, the errors in computational models tend to be systematic. This opens
the possibility of correcting computational errors using statistical methods. This talk
will show how a hybrid approach, wherein computational models are
reparameterized to agree with experimental data, can provide computational
models for pKa’s that approach experimental accuracy. The primary tools are
multiple regression (with great care taken to avoid overfitting); the residual errors
can be removed, at least in part, using a machine learning method called Gaussian
process regression (kriging). These methods are applied to a diverse set of acids,
including molecular acids (carboxylic acids, alcohols, and amines) and proteins.
While experimental accuracy is not attainable, root-mean-square errors that are
significantly less than 1 pKa unit are attainable.
At the end of the talk, I will discuss our future plans, including our early attempts to
select a drug molecule that binds to an important carcinogenic protein receptor.
Once again, the accuracy that is required for this application seems to exceed what
we can achieve with straightforward computational methods, and statistical and/or
machine-learning approaches are needed.

Back to Workshop IV: Physical Frameworks for Sampling Chemical Compound Space