When More Data Do Not Provide A Better Description

Matthias Scheffler
Fritz-Haber-Institut der Max-Planck-Gesellschaft

The discovery of improved and novel -- not just new -- materials or unknown properties of known materials to meet specific scientific or industrial requirements is one of the most exciting and economically important applications of high-performance computing (HPC). Ab initio computational materials science enables the modeling of materials, both existing materials and those that can be created in the future, at the electronic and atomic levels. This also allows for the accurate prediction of how these materials will behave at the microscopic and macroscopic levels, and an understanding of their suitability for specific research and commercial applications.

The number of possible materials, considering structure and chemical composition, is practically infinite. However, the number of materials that exhibit a certain function, is rather small, i.e. the space of chemical and structural compounds is sparsely populated. Thus, the urgent problems in materials science, e.g. finding a better catalyst for CO2 chemistry (turning a green-house gas into valuable chemicals and fuels) is like searching for needles in a haystack. Improving the description of the hay and adding even more hay, obviously, will not help to find the needles.

This problem was addressed in recent years by developing compressed sensing [1-3] and subgroup discovery [4] approaches for materials science where the big-data challenge is put less on the hay but on the identification of appropriate descriptors of the needles. The SISSO approach (sure independence screening and sparsifying operator) addresses this challenge by identifying the best low-dimensional descriptor in an immensity of offered candidates (trillions of possible descriptors). The talk describes this methodology and the remaining challenges.

1) L.M. Ghiringhelli, J. Vybiral, S.V. Levchenko, C. Draxl, and M. Scheffler, Big Data of Materials Science: Critical Role of the Descriptor. Phys. Rev. Lett. 114, 105503 (2015).
2) L.M. Ghiringhelli, J. Vybiral, E. Ahmetcik, R. Ouyang, S.V. Levchenko, C. Draxl, and M. Scheffler, Learning physical descriptors for materials science by compressed sensing. New J. Phys. 19, 023017 (2017).
3) R. Ouyang, S. Curtarolo, E. Ahmetcik, M. Scheffler, and L.M. Ghiringhelli, SISSO: a compressed-sensing method for identifying the best low-dimensional descriptor in an immensity of offered candidates. Phys. Rev. Mat. 2, 083802 (2018).
4) B.R. Goldsmith, M. Boley, J. Vreeken, M. Scheffler, and L.M. Ghiringhelli, Uncovering structure-property relationships of materials by subgroup discovery. New J. Phys. 19, 013031 (2017).

Back to Workshop II: HPC and Data Science for Scientific Discovery