Mathematical Challenges in Scientific Data Mining
January 14 - 18, 2002
Advances in technology have enabled us to collect data from observations, simulations, and experiments at an ever-increasing pace. For the scientist to benefit from these enhanced data collecting capabilities, it is becoming clear that semi-automated techniques, such as the ones in data mining, must be applied to find the useful information in the data. Data Mining is the discovery of patterns, associations, anomalies, and statistically significant structures in data. It is a multi-disciplinary field, borrowing and enhancing ideas from diverse areas such as statistics, signal and image processing, image understanding, mathematical optimization, computer vision, and pattern recognition.
Mining scientific data sets is an area rich in challenging mathematical problems, where the complexity and size of the data, is matched only by the diversity of applications. Several recent workshops on the subject have indicated that this a field of active research with potential beneficiaries in areas such as astronomy, remote sensing, physics, bio-informatics, medical imaging, non-destructive evaluation, combinatorial chemistry, etc.
Building on this growing interest in the topic, we are organizing an IPAM short program on the Mathematical Challenges in Scientific Data Mining in early 2002 (January 14-18). This week-long meeting will bring together mathematicians, data mining practitioners, and domain scientists to share their experiences. We hope to accomplish the following goals:
To bridge the cultural and knowledge gaps between the mathematicians, data mining practitioners, and domain scientists, we envision three broad themes as outlined in the following. The first two themes will help us to identify the common threads across diverse applications and data sets. This would set the stage for the third theme, namely, identifying and addressing the mathematical challenges in scientific data mining.
1. Understanding the nature and types of data
Focusing on data from astronomical surveys, remote sensing, medical imaging, bio-informatics, computer simulations, etc. we would first identify the common types of data in science and engineering problems. These would include spatial data (two- and three-dimensional, with multivariate fields), spatio-temporal data, data in the form of hierarchical structures, simulated vs. observed data, grid vs. "object" data, multi-spectral and multi-resolution data from multiple sensors etc.
2. Scientific data mining tasks
This theme would help the short-program participants understand what the scientists want to do with the data. An underlying assumption here is that the data available is typically in a "raw form" (e.g. pixel values or variables at a mesh point). Higher level representations have to be abstracted from this lower level data before we can make inferences (i.e. detect patterns or other useful information) at the higher level. The tasks at the lower level would typically include object detection and characterization, tracking of objects, registration and alignment necessary for data fusion, as well as feature measurements. At the higher level, there are the more traditional pattern recognition tasks of classification, clustering, regression, interactive retrieval, novelty detection, verification and validation etc. One could also envisage a "middle" level, where the high dimensionality of the higher level representation is reduced to make the process of inference tractable.
3. Mathematical algorithms, challenges, and issues
The main focus here would be the role various mathematical techniques can play to enable the tasks in theme 2 above to be applied efficiently and accurately to the data in theme 1. Possible issues that could be addressed include (but are not limited to):
Structure of the Program
The program will include introductory tutorials, several talks on data mining algorithms as well as scientific and engineering applications, and contributed sessions. The goal is to provide both an introduction to the field, and introduce the participants to some of the mathematical challenges and potential solutions.
Related Web Sites
First SIAM Conference on Data Mining (April, 2001, Chicago)
SpeakersPierre Baldi (University of California at Irvine)
Bir Bhanu (University of California at Riverside)
Amy Braverman (Jet Propulsion Laboratory)
Leo Breiman (University of California at Berkeley)
John Brence (United States Military Academy)
Donald Brown (University of Virginia)
Tony Chan (UCLA)
George Djorgovski (California Institute of Technology)
Imola Fodor (Lawrence Livermore National Laboratory)
Chandrika Kamath (Lawrence Livermore National Laboratory)
George Karypis (University of Minnesota)
Vipin Kumar (University of Minnesota/AHPCRC)
Raghu Machiraju (Ohio State University)
Ivan Marusic (University of Minnesota)
Marco Mazzucco (University of Illinois at Chicago)
Dennis Mock (University of California at San Diego)
Douglas Nychka (National Center for Atmospheric Research)
Zoran Obradovic (Temple University)
Stanley Osher (IPAM)
Rahul Ramachandran (University of Alabama in Huntsville)
Terrence Sejnowski (Salk Institute)
Padhraic Smyth (University of California at Irvine)
Paul Thompson (UCLA)
Shusaku Tsumoto (Shimane Medical University)
Mike Turmon (Jet Propulsion Laboratory)
Mladen Wickerhauser (Washington University)
Mohammed Zaki (Rensselaer Polytechnic Institute)
Institute for Pure and Applied Mathematics (IPAM)