Subgroup Discovery for Assessing the Domain of Applicability of Machine Learning Models

Chris Sutton
Fritz-Haber-Institut der Max-Planck-Gesellschaft

Advances in artificial intelligence (AI) are making a large impact in materials science, chemical engineering, and computational chemistry. One promising application of AI is the rapid screening of large chemical spaces by using statistical estimates of a given property (typically computed using electronic-structure calculations). This application requires training accurate machine learning (ML) models, which is typically achieved by increasing the size of the training set. When applying this strategy, the prediction accuracy of ML models is typically assessed using only a summary statistical quantity (e.g., root mean squared error). However, the application of ML to materials or molecular discovery requires a high accuracy for a relatively small set of top performers. Here I will discuss a procedure using subgroup discovery for differentiated assessment and uncertainty quantification of ML models. This approach allows for the identification of a simple set of interpretable conditions under which the ML model performs substantially better than what is indicated by the average error (i.e., a domain of applicability). We anticipate this sort of analysis to be useful not only to advance the development of ML representations for materials science but also for other scientific disciplines that have begun to leverage ML. This work was carried out in collaboration with Mario Boley, Luca M. Ghiringhelli, Mathias Rupp, Jilles Vreeken, and Matthias Scheffler.

Back to Workshop IV: Using Physical Insights for Machine Learning