Graduate-Level Research in Industrial Projects for Students (GRIPS)-Berlin 2018

June 25 - August 17, 2018

Sponsors and Projects

The sponsors and projects for 2018 include:

Project 1: 1000Shapes – Biotechnology

Project 2: Deutsche Bahn – Public Transport

Project 3: Open Grid Europe – Gas Networks

Project 1: 1000Shapes GmbH – Biotechnology

Hosting Lab

The members of the MedLab develop new mathematical methods that allow identification of disease specific signatures within modern large-scale bio-medical datasets, such as genomics or proteomics sources. Having these signatures (e.g. changing concentrations of a blood protein during some viral infection) will allow to build new diagnostic test but also to gain insights about disease mechanisms. This is based on the insight that changes in cells – while they undergo transformation from “normal” to a malignant state (e.g. during infections) – happen on many biological levels, including genes, proteins and metabolites. Integrative analysis of all these levels allows generation of more detailed and informative models about a disease when compared to just analyzing the effect of single biomarkers, such as blood values or proteins levels.

Sponsor

The project is in close collaboration with 1000shapes GmbH, a ZIB spin-off that transfers research into industrial applications. 1000shapes provides advanced solutions in image and geometry processing for 2D and 3D product design, covering the full spectrum from measurement, analysis, planning up to manufacturing. In the medical field, 1000shapes is interested in analyzing medical image based data, such as x-ray, CT or MRT data.

Project

Building on state-of-the-art database technology, students will develop new machine-learning techniques to analyze medical massive data sets. First, students will learn the necessary biological foundation needed to successfully complete the project. They will then use data from a large clinical trial to model medical phenomena based on ideas from the areas of compressed sensing, machine learning, and network-of-networks theory.

Background:

Tumor diseases rank among the most frequent causes of death in Western countries coinciding with an incomplete understanding of the underlying pathogenic mechanisms and a lack of individual treatment options. Hence, early diagnosis of the disease and early relapse monitoring are currently the best available options to improve patient survival. This calls for two things: (1) identification of disease specific sets of biological signals that reliably indicate a disease outbreak (or status) in an individual. We call these sets of disease specific signals fingerprints of a disease. And (2), development of new classification methods allowing for robust identification of these fingerprints in an individual’s biological sample. In this project we will use -omics data sources, such as proteomics or genomics. The advantage of -omics data over classical (e.g. blood) markers is that for example a proteomics data set contains a snapshot of almost all proteins that are currently active in an individual, opposed to just about 30 values analyzed in a classical blood test. Thus, it contains orders of magnitudes more potential information and describes the medical state of an individual much more precisely. However, to date there is no gold-standard of how to reliably and reproducible analyze these huge data sets and find robust fingerprints that could be used for the ultimate task: (early) diagnostics of cancer.

Problems and (some) hope:

Most of the data coming from available bio-medical data sources, such as images or proteomics data, is ultra high-dimensional and very noisy. At the same time, this data exhibits a very particular structure, in the sense that it is highly sparse. Thus the information content of this data is much lower than its actual dimension seems to suggest, which is the requirement for any following step in this project: the dimension reduction of the data with as little loss of information as possible.

Unfortunately the sparsity structure of this data is complex, (in most cases) not known a-priori, and usually does not coincide with often assumed patterns such as joint sparsity or Gaussian noise. This means, although the data is highly sparse, the sparsity structure as well as the noise distribution is non-standard. However, specifically adapted dimension reduction strategies such as compressed sensing do not readily exist e.g. for proteomics data.

However, methods exist that allow to identify the sparsity structure of the contained information from very high-dimensional, noisy -omics and imaging data. Once this has been achieved, the next step is the integrating of the (low-dimensional) information into one unified mode. We will use a network-based approach, modelling the various biological levels through a multiplex network coming from existing databases such as known protein/protein or gene/protein interactions. The hope is that this model can shed some light on the mechanisms of osteoarthritis and maybe even allow new ways of early diagnosis of this disease.

The Goal:

In this project we aim to develop a new method that can be used to solve this task: the identification of minimal, yet robust fingerprints from very high-dimensional, noisy -omics data. Our method will be based on ideas from the areas of compressed sensing and machine learning.

Requirements

The prospective participant should:

have a background in mathematics, bioinformatics or computer science,
have experience in network analysis,
have experience with a high-level programming language (e.g. C/C++, Java or Python) and a statistical software package such as R,
have attended classes in the area of data mining or acquired the foundations of this field by some other means
be prepared to work with very large datasets from industry partners (which involves preprocessing, e.g. to overcome inconsistencies and incompleteness).
Ideally he or she is familiar with the biological background and has already worked with biological data-sets,
and- finally – has experience in working in a Linux/Unix environment.

**************************************

Project 2: Deutsche Bahn – Public Transport

Hosting Lab

The RailLab located at Zuse Institute cooperates with DB Fernverkehr to develop an optimization core that helps to operate the Intercity-Express (ICE), Germany’s fastest and most prestigious train, in the most efficient way. This is achieved by determining how the ICEs should rotate within Germany and, thereby, reducing the number of empty trips. The software has now been deployed in production at DB Fernverkehr for several years.

Sponsor

Deutsche Bahn (DB) is the Germany’s major railway company. It transports on average 5.4 million customers every day over a rail network that consists of 33,500 km of track, and 5,645 train stations. DB operates in over 130 countries world-wide. It provides its customers with mobility and logistical services, and operates and controls the related rail, road, ocean and air traffic networks.

Project

You will learn to think about railway networks from a planner’s perspective. Making up ICE rotations sounds easy at first, but you will soon find out that a lot of constraints have to be taken into account and do not forget about the size of Germany’s rail network! This makes finding and understanding suitable mathematical programming models a difficulty of its own. It will be your daily business to deal with huge data sets. You will write scripts to process the data and extract useful information. The past project assignments included to find out how robust optimization methodology can be incorporated in the optimization process and to develop a rotation plan for the situation that a restricted amount of train conductors is available, e.g. in a strike scenario.

Requirements

The prospective participant should:

have a good command of a high-level programming language (preferably C++) and experience in writing scripts, e.g. in Python or Shell,
have attended classes in the area of combinatorial optimization, linear and integer programming or acquired the foundations of this field by some other means
be prepared to work with huge datasets from industry partners (which involves cleaning and preprocessing to overcome inconsistencies and incompleteness).

Ideally he or she

is familiar with procedures in the area of rail traffic and/or logistics,
has experience in working in a Linux/Unix environment and
collaborative work on source code (e.g. working with revision control systems).

**************************************

Project 3: Open Grid Europe – Gas Networks

Hosting Lab

The GasLab aims to develop new methods for gas-grid planning and operation, combining the most modern mathematic algorithms and up-to-date information technology. The tools developed in the GasLab will assist gas dispatchers and planners in their work and put them in a position to make better decisions based on foresighted and comprehensive information. The GasLab brings together the main areas of expertise of the scientific partners; that is, modeling, simulation, and optimization to advance the state of the art in gas-grid management and facilitate innovations.

Sponsor

Open Grid Europe (OGE) operates Germany’s largest gas transmission pipeline system with a gas network spanning more than 12,000 kilometres. All over the country, more than 1,450 staff ensure safe, environmentally-friendly and customer-oriented gas transmission. OGE supports the European gas market and works together with the European distribution network operators to create the prerequisites for transnational gas transportation and trading.

Project

Real-world networks from various domains have been shown to be small-world (large local clustering coefficient and small diameter) and scale-free (node degrees follow a power law). Additionally, they are often showing a hierachical organisation, since they reflect the modularity of the underlying system. An important step in understanding these complex systems is to identify sub-networks and their hierarchical structure. Having this knowledge allows for example to derive stategies for optimal transportation through these types of networks. However, most existing methods are designed to find non-overlapping subnetwork and don’t allow nodes being shared by different modules. It is easy to see that this limitation needs to be overcome to analyse complex networks such as the German gas network. This is because a main purpose of the network is to distribute gas to actual regional sub-structures such as cities while many cities share large pipelines coming e.g. from storage systems. To make things even more complicated, most real-world networks such as the gas network change over time. A simple example is the down time of parts of the network for maintanance, e.g. shutting down a pipe connecting two sub-networks.

The aim of this project is to analyse time-evolving hierarchical networks, such as the German gas-network, in order to understand their inner structure. Based on this structural understanding, processes based on these networks will be modelled, simulated and compared to real world phenomena. An example for such a processes is the gas-flow within such a system, including its physical properties. Once structural understanding and process understanding is achieved, the ultimate goal will be to use this knowledge to understand the inner logic of such a complex system with respect to flow prediction between the sub-systems over time. A typical question would be: given the demand of a particular sub-system (e.g. a large consumer) over time – what will be the demand tomorrow? You will learn in this project that answering the question is faiirly easy if particular smoothness conditions can be assumed (e.g. about the demand, as often done in modelling courses at university) but painfully fails using standard approaches if real world scenarios are targeted. And you will learn what can be done in these cases.

Requirements

The prospective participant should:

have programming skills in some higher level programming language such as C/C++, Java, or Python
is familiar with algorithms in the area of network science,
be prepared to work with real-world datasets from industry partners (which involves cleaning and preprocessing to overcome inconsistencies and incompleteness).

Ideally he or she

have attended classes in the area of combinatorial optimization, linear and integer programming or acquired the foundations of this field by some other means
has experience in working in a Linux/Unix environment and
collaborative work on source code (e.g. working with revision control systems).