NSF Logo IPAM Logo UCLA Logo

Mathematical Challenges in Scientific Data Mining

January 14 - 18, 2002

Schedule and Presentations

Pictures

Summary

Organizing Committee:

Ananth Grama (Purdue University)
Chandrika Kamath (Lawrence Livermore National Laboratory)
B. S. Manjunath (UCSB)
Padhraic Smyth (University of California at Irvine)

Introduction

Advances in technology have enabled us to collect data from observations, simulations, and experiments at an ever-increasing pace. For the scientist to benefit from these enhanced data collecting capabilities, it is becoming clear that semi-automated techniques, such as the ones in data mining, must be applied to find the useful information in the data. Data Mining is the discovery of patterns, associations, anomalies, and statistically significant structures in data. It is a multi-disciplinary field, borrowing and enhancing ideas from diverse areas such as statistics, signal and image processing, image understanding, mathematical optimization, computer vision, and pattern recognition.

Mining scientific data sets is an area rich in challenging mathematical problems, where the complexity and size of the data, is matched only by the diversity of applications. Several recent workshops on the subject have indicated that this a field of active research with potential beneficiaries in areas such as astronomy, remote sensing, physics, bio-informatics, medical imaging, non-destructive evaluation, combinatorial chemistry, etc.

Building on this growing interest in the topic, we are organizing an IPAM short program on the Mathematical Challenges in Scientific Data Mining in early 2002 (January 14-18). This week-long meeting will bring together mathematicians, data mining practitioners, and domain scientists to share their experiences. We hope to accomplish the following goals: 

  • Help mathematicians and data miners to understand the problems faced by domain scientists in analyzing their data 
  • Enable data miners to identify techniques being used to solve similar problems in different domains 
  • Enable all participants to better understand the mathematics under-lying various techniques 
  • Identify open mathematical problems that must be addressed for data mining to be successfully applied to complex data sets in science and engineering applications

To bridge the cultural and knowledge gaps between the mathematicians, data mining practitioners, and domain scientists, we envision three broad themes as outlined in the following. The first two themes will help us to identify the common threads across diverse applications and data sets. This would set the stage for the third theme, namely, identifying and addressing the mathematical challenges in scientific data mining.

1. Understanding the nature and types of data

Focusing on data from astronomical surveys, remote sensing, medical imaging, bio-informatics, computer simulations, etc. we would first identify the common types of data in science and engineering problems. These would include spatial data (two- and three-dimensional, with multivariate fields), spatio-temporal data, data in the form of hierarchical structures, simulated vs. observed data, grid vs. "object" data, multi-spectral and multi-resolution data from multiple sensors etc.

2. Scientific data mining tasks

This theme would help the short-program participants understand what the scientists want to do with the data. An underlying assumption here is that the data available is typically in a "raw form" (e.g. pixel values or variables at a mesh point). Higher level representations have to be abstracted from this lower level data before we can make inferences (i.e. detect patterns or other useful information) at the higher level. The tasks at the lower level would typically include object detection and characterization, tracking of objects, registration and alignment necessary for data fusion, as well as feature measurements. At the higher level, there are the more traditional pattern recognition tasks of classification, clustering, regression, interactive retrieval, novelty detection, verification and validation etc. One could also envisage a "middle" level, where the high dimensionality of the higher level representation is reduced to make the process of inference tractable.

3. Mathematical algorithms, challenges, and issues

The main focus here would be the role various mathematical techniques can play to enable the tasks in theme 2 above to be applied efficiently and accurately to the data in theme 1. Possible issues that could be addressed include (but are not limited to): 

  • Data registration and alignment techniques to fuse multi-sensor, multi-resolution, multi-spectral data 
  • Mathematical optimization techniques, evolutionary algorithms, and possible combinations for improving the accuracy and efficiency of the pattern recognition tasks 
  • Accurate techniques for tracking objects in space and time 
  • Algorithms for extracting features that are rotation-, translation-, and scale-invariant 
  • Robust methods for object detection and characterization in two- and three- dimensions using techniques such as deformable models, level sets, genetic algorithms etc. 
  • Dimension reduction techniques for high-dimensional data such as independent component analysis, blind source separation, non-linear PCA etc. and the connections among them. The focus would be on non-linear combinations and basis functions that are non-orthogonal. 
  • Techniques for mining data which is hierarchical or is compressed using multi-resolution techniques 
  • The role of graph theory in data mining 
  • Techniques to interpret the effects of uncertainty in data 
  • Statistical issues associated with all aspects of data mining 
  • In the case of simulation data, techniques to exploit the conservation laws that are obeyed by the physical models

Structure of the Program 

The program will include introductory tutorials, several talks on data mining algorithms as well as scientific and engineering applications, and contributed sessions. The goal is to provide both an introduction to the field, and introduce the participants to some of the mathematical challenges and potential solutions.

Related Web Sites 

First Workshop on Mining Scientific Datasets 

Second Workshop on Mining Scientific Datasets 

First SIAM Conference on Data Mining (April, 2001, Chicago) 

Third Workshop on Mining of Scientific Data Sets 

Seventh ACM SSIGKDD International Conference on Knowledge Discovery and Data Mining (August 2001) 

Fourth Workshop on Mining Scientific Data sets

Speakers

Pierre Baldi (University of California at Irvine)
Bir Bhanu (University of California at Riverside)
Amy Braverman (Jet Propulsion Laboratory)
Leo Breiman (University of California at Berkeley)
John Brence (United States Military Academy)
Donald Brown (University of Virginia)
Tony Chan (UCLA)
George Djorgovski (California Institute of Technology)
Imola Fodor (Lawrence Livermore National Laboratory)
Chandrika Kamath (Lawrence Livermore National Laboratory)
George Karypis (University of Minnesota)
Vipin Kumar (University of Minnesota/AHPCRC)
Raghu Machiraju (Ohio State University)
Ivan Marusic (University of Minnesota)
Marco Mazzucco (University of Illinois at Chicago)
Dennis Mock (University of California at San Diego)
Douglas Nychka (National Center for Atmospheric Research)
Zoran Obradovic (Temple University)
Stanley Osher (IPAM)
Rahul Ramachandran (University of Alabama in Huntsville)
Terrence Sejnowski (Salk Institute)
Padhraic Smyth (University of California at Irvine)
Paul Thompson (UCLA)
Shusaku Tsumoto (Shimane Medical University)
Mike Turmon (Jet Propulsion Laboratory)
Mladen Wickerhauser (Washington University)
Mohammed Zaki (Rensselaer Polytechnic Institute)

Contact Us:

Institute for Pure and Applied Mathematics (IPAM)
Attn: SDM2002
460 Portola Plaza
Los Angeles CA 90095-7121
Phone: 310 825-4755
Fax: 310 825-4756
Email: ipam@ucla.edu
Website: http://www.ipam.ucla.edu/programs/sdm2002/


Home ] [ People ] [ Events ]  Programs  [ Visitor Info ]
Contact: (310)825-4755