Learning descriptors from materials-science (big) data

Matthias Scheffler
Fritz-Haber-Institut der Max-Planck-Gesellschaft

Scientific discoveries often proceed from the accumulation of consistent data to the identification of functional dependencies among the data, i.e., a model that is able to predict yet unseen phenomena. Ultimately, a theory may be constructed to explain the model with few simple principles. Classical examples are i) the three laws of Kepler, that were empirically found by observing the known data on the solar system, later justified by Newton's theory of gravitation, and ii) the periodic table of Mendeleev, empirically constructed from data on the chemistry of known elements, later justified by the atomic theory within quantum mechanics.

In the last decades, statistical learning has been developed in order to find optimal and stable functional dependencies among data, in particular when some ancillary knowledge can be formalized and included in the search for optimal solutions.

We present a recently introduced compressed-sensing based methodology and its latest extension, for the identification of functional dependencies where the descriptor (the set of input variables of the functional dependence) is selected out of a dictionary of "well formed" candidate analytical expressions. Such candidates are constructed as non-linear functions of a set of basic "physically meaningful" features, called primary features.

Furthermore, we present a complementary method, called subgroup discovery (SGD), designed for constructing statements, in the form of true/false boolean expressions, about an optimal subset of candidate functions of primary features.

Results from the application of both methods are presented for the crystal structure prediction of binary materials and (only for SGD) for the identification of relationships between electronic- and atomic-structure properties of metal nanoclusters.

This is joint work with Luca Ghiringelli.


Back to Long Programs