Graduate-Level Research in Industrial Projects for Students (GRIPS)-Berlin 2016

June 27 - August 19, 2016

Sponsors and Projects

The sponsors and projects for 2016 include:

Project 1: Train Planning – Deutsche Bahn (DB)

Project 2: Large Medical Data Analysis – MODAL AG (MAG)

Project 3: Therapy Planning – 1000shapes GmbH

Project 1: Train Planning – Deutsche Bahn (DB)

Hosting Lab

The MODAL:RailLab cooperates with DB Fernverkehr to develop an optimization core that helps to operate the Intercity-Express (ICE), Germany’s fastest and most prestigious train, in the most efficient way. This is achieved by determining how the ICEs should rotate within Germany and, thereby, reducing the number of empty trips. The software has now been deployed in production at DB Fernverkehr for two years.

Sponsor

Deutsche Bahn (DB) is Germany’s major railway company. It transports on average 5.4 million customers every day over a rail network that consists of 33,500 km of track, and 5,645 train stations. DB operates in over 130 countries world-wide. It provides its customers with mobility and logistical services, and operates and controls the related rail, road, ocean and air traffic networks.

Project

You will learn to think about the railway network at DB from a planner’s perspective. Making up ICE rotations sounds easy at first, but you will soon find out that a lot of constraints have to be taken into account and do not forget about the size of Germany’s rail network! This makes finding and understanding suitable mathematical programming models a difficulty of its own. It will be your daily business to deal with huge data sets. You will write scripts to process the data and extract useful information. At your option you can come up with your own ideas and propose and implement extensions for our optimization core. The past project assignments included to find out how robust optimization methodology can be incorporated in the optimization process and to develop a rotation plan for the situation that a restricted amount of train conductors is available, e.g. in a strike scenario.

Requirements

The prospective participant should:

have a good command of a high-level programming language (preferably C++) and experience in writing scripts, e.g. in Python or Shell,
have attended classes in the area of combinatorial optimization, linear and integer programming or acquired the foundations of this field by some other means
be prepared to work with huge datasets from industry partners (which involves cleaning and preprocessing to overcome inconsistencies and incompleteness).

Ideally he or she:

is familiar with procedures in the area of rail traffic and/or logistics
has experience in working in a Linux/Unix environment and
collaborative work on source code (e.g. working with revision control systems).

**************************************

Project 2: Large Medical Data Analysis – MODAL AG (MAG)

Hosting Lab

Changes in cells while they are undergoing transformation from “normal” to malignant cells (e.g. during infections) happen on many biological levels, such as genome, transcriptome, proteome and metabolome. Following the central dogma of molecular biology and its extensions these levels are highly interconnected and depend on each other. Within the MODAL:MedLab, we develop new mathematical methods that allow (1) identification of multivariate disease signatures that describe changes in multiple data-sources and (2) development of multi-level models that embeds these findings into the actual biological context. Both parts combined will eventually lead to a thorough understanding of the modeled process and open up the opportunity to use the respective model for diagnostic purposes for individuals, thus allowing high-throughput classification of biological samples. These techniques can then be adjusted to an individual by using its -omics data and thus allows to derive information about the individual’s state, for example, as a diagnostic tool for a certain disease that is captured by the data and the model. All algorithms will be implemented using state-of-the art software frameworks that can cope with the very large data volumes.

Sponsor

The MODAL AG (MAG) is a ZIB spin-off that works as a bridge between research and industry. MAG offers the students in this project access to real world data and expertise from leading hospitals and companies working in this field. Within the MAG infrastructure, students will have the opportunity to experience creation of industry-strength technology and software solutions.

Project

Building on state-of-the-art database technology, students will develop new machine-learning techniques to analyze medical massive data sets. First, students will learn the necessary biological foundation needed to successfully complete the project. They will then use data from a large clinical trial to model medical phenomena based on ideas from the areas of compressed sensing, machine learning, and network-of-networks theory.

Background: Tumor diseases rank among the most frequent causes of death in Western countries coinciding with an incomplete understanding of the underlying pathogenic mechanisms and a lack of individual treatment options. Hence, early diagnosis of the disease and early relapse monitoring are currently the best available options to improve patient survival. This calls for two things: (1) identification of disease specific sets of biological signals that reliably indicate a disease outbreak (or status) in an individual. We call these sets of disease specific signals fingerprints of a disease. And (2), development of new classification methods allowing for robust identification of these fingerprints in an individual’s biological sample. In this project we will use -omics data sources, such as proteomics or genomics. The advantage of -omics data over classical (e.g. blood) markers is that for example a proteomics data set contains a snapshot of almost all proteins that are currently active in an individual, opposed to just about 30 values analyzed in a classical blood test. Thus, it contains orders of magnitudes more potential information and describes the medical state of an individual much more precisely. However, to date there is no gold-standard of how to reliably and reproducible analyze these huge data sets and find robust fingerprints that could be used for the ultimate task: (early) diagnostics of cancer.

Problems and (some) hope: -omics data is ultra high-dimensional and very noisy – but only sparsely filled with information: Biological -omics data (e.g. proteomics or genomics data) is typically very large (millions of dimensions), which increases the complexity of algorithms for analyzing the parameter space significantly or makes them even infeasible. At the same time, this data exhibits a very particular structure, in the sense that it is highly sparse. Thus the information content of this data is much lower than its actual dimension seems to suggest, which is the requirement for any dimension reduction with small loss of information.

However, the sparsity structure of this data is highly complex, since not only do the large entries exhibit a particular clustering with the amplitudes forming Gaussian-like shapes, but also the noise affecting the signal is by no means Gaussian noise — a customarily assumed property. In addition, considering different sample sets, those clusters also slightly differ in the locations from sample set to sample set, hence also do not coincide with normal patterns such as joint sparsity. This means, although the data is highly sparse, the sparsity structure as well as the noise distribution is non-standard. However, specifically adapted automatic — without cumbersome by-hand-identification of significant values — dimension reduction strategies such as compressed sensing have actually never been developed for instance for proteomics data. In our project, such a dimension reduction step will be a crucial ingredient and shall precede the analysis of parameter space, thereby then enabling low complexity algorithms.

The major challenge in these applications is to extract a set of features, as small as possible, that accurately classifies the learning examples.

The goal: In this project we aim to develop a new method that can be used to solve this task: the identification of minimal, yet robust fingerprints from very high-dimensional, noisy -omics data. Our method will be based on ideas from the areas of compressed sensing and machine learning.

Requirements

The prospective participant should:

have a background in mathematics, bioinformatics or computer science,
have experience with a high-level programming language (e.g. C++ or Java) and a statistical software package such as SPSS or R,
have attended classes in the area of data mining or acquired the foundations of this field by some other means
be prepared to work with very large datasets from industry partners (which involves preprocessing, e.g. to overcome inconsistencies and incompleteness).

Ideally he or she

is familiar with the biological background and has already worked with biological data-sets,
has experience in working in a Linux/Unix environment and
collaborative work on source code (e.g. working with revision control systems).

**************************************

Project 3: Therapy Planning – 1000shapes GmbH

Hosting Lab

Within the therapy planning group at ZIB we are dealing with a variety of medical data. To tackle the challenges of analyzing an always increasing amount of data and to provide software tools to automatically extract the relevant information out of it, we are investigating model based approaches (statistical shape and appearance models) as well as machine learning techniques (regression, classification, and semi-supervised learning), which can then be used in a number of applications such as scene recognition from photographs, object recognition in images, or automatic diagnosis from medical image data.

Sponsor

The project is in close collaboration with 1000shapes GmbH, a ZIB spin-off that transfers research in life sciences into products for clinical applications. Within the project, algorithms are to be developed within an existing software framework and tested on clinical image data. The successful applicant will have the opportunity to perform research in medical image computing within the ZIB research group therapy planning while obtaining professional support from 1000shapes in software development and implementing algorithms within existing software frameworks. Within the project, students will have the opportunity to experience medical research in combination with industry-strength software development.

Project

Building on a large medical image database, students will investigate new machine-learning techniques, i.e. the application of regression forests, to analyze and classify features or disease patterns in medical image data.

Background: Osteoarthritis (OA) is the most common form of arthritis and the major cause of activity limitation and physical disability in older people. Today, 35 million people (13 percent of the U.S. population) are 65 and older, and more than half of them have radiological evidence of osteoarthritis in at least one joint. By 2030, 20 percent of Americans (about 70 million people) will have passed their 65th birthday and will be at risk for OA. The Osteoarthritis Initiative (OAI) is a multi-center, longitudinal, prospective observational study of knee osteoarthritis (KOA), providing a database for osteoarthritis that includes clinical evaluation data, radiological (x-ray and magnetic resonance) images, and a bio-specimen repository from over 5000 patients. Radiological images are a rich source of information of both anatomical and physiological properties of the human body. This information has great potential both for developing a better understanding of disease onset and progression, as well as improving future therapeutic concepts. However, such vast amounts of data require automated image processing approaches.

Challenges: The overall goal is to automatically extract anatomical shape and appearance (image intensity) information such as bone, cartilage, tendons and muscle tissues from the radiological images. To this end, machine learning combined with model-based approaches shall be adopted. In particular, the random forest regression voting method has been shown to be very successful in localizing landmarks in 2D x-ray images. However, this approach is computationally quite expensive both in terms of CPU time and particularly memory consumption. Furthermore, 3D magnetic resonance images have a significantly larger range of variation in intensity due to their acquisition process.

Aim: The goal of this project is to evaluate existing implementations of the random forest regression voting methods and to develop and evaluate own implementations of it for 3D shape detection in magnetic resonance tomography data. Results of this work may become a basis for improving existing segmentation, classification, and diagnosis algorithms that are currently under development or even in use by 1000shapes.

Requirements

The prospective participant should:

have a background in computer science, bio-engineering, mathematics or physics,
have experience with collaborative code development in C++ on Windows or Linux,
have attended classes in the area of machine learning, computer vision, image processing or statistics,
be acquainted with software for processing large medical image datasets,
have fun working in teams, both with colleagues from academia and industry,
be able to communicate fluently in English