Scalable Data Science and Apache Flink: Key Challenges and (Some) Solutions

Volker Markl
Technische Universität Berlin

Big data holds great promise. However, in today’s job market, there are an insufficient number of qualified data scientists. As a consequence, this shortage is effectively limiting big data from fully realizing its potential to deliver insight and provide value for scientists, business analysts, and society as a whole. Hence, we believe that novel technologies that draw on the concepts of declarative languages, query optimization, automatic parallelization, and hardware adaptation are necessary, in order to resolve the human resource bottleneck. In this talk, we will discuss several aspects of our research in this area, including results on how to optimize iterative data flow programs, optimistic fault-tolerance, and steps toward a deep language embedding of advanced data analysis programs. We will also discuss how our research activities have led to Apache Flink, an open-source big data analytics system that is today a major data processing engine in the Apache Big Data Stack used in a variety of applications by academia and industry.

Presentation (PDF File)

Back to Big Data Meets Computation