Big data in materials science: New tools for getting insight into materials properties and functions

Claudia Draxl

Computational high-throughput-screening initiatives are producing materials data on workstations, compute clusters, and high-performance facilities with an exponential growth rate. Typically, only a small faction of what we compute is finally published, while most of the results containing all the information on the quantum-mechanical many-body problem are thrown away. Keeping the data, though, could be considered as a big-data problem. It could, however, also be considered as a chance – the chance to learn about physical properties and processes.
How to exploit the wealth of information, inherently inside the materials data which promises unprecedented insight? On the one hand, new tools need to be developed to explore similarities among materials and their properties, find out trends and anomalies. These tools comprise approaches of data-analytics, like machine-learning, compressed sensing, and alike. But there are other relevant factors for this new branch of materials research to be successful. This concerns issues of accuracy, error bars, and comparability of data.
I will introduce the NOMAD (Novel Materials Discovery) Laboratory – a European Center of Excellence [2] that tackles all these questions. It starts from the NoMaD Repository [1], which was established to promote the idea of open access and sharing of materials data. At present, this repository contains input and output files of more than 3 Mio. calculations, that were produced by 10 different electronic-structure codes. This large collection of materials data indeed opens an avenue towards novel materials discovery. To do so, the first critical step is to create an archive of data that are unbiased with respect to the underlying code and whose error bars are known. Based on this, the NOMAD Laboratory creates a Materials Encyclopedia and develops big-data analytics tools for materials science. I will demonstrate examples how data analytics combined with domain-specific knowledge can lead to new scientific insight and discuss the question whether we can find new equations based on big data.

[1] The Novel Materials Discovery (NoMaD) Repository:
[2] NOMAD Center of Excellence, funded by the EU within HORIZON2020:

Back to Workshop I: Machine Learning Meets Many-Particle Problems