Reducing Size and Complexity of Large Geophysical Data Sets

Amy Braverman
Jet Propulsion Laboratory

This talk discusses a procedure for compressing large data sets, particularly
geophysical ones like those obtained from remote sensing satellite instruments.
Data are partitioned by space and time, and a penalized clustering algorithm
applied to each subset independently. The algorithm is based on the
entropy-constrained vector quantizer (ECVQ) of Chou, Lookabaugh and Gray (1989).
In each subset ECVQ trades off error against data reduction to produce a set of
representative points that stand in for the original observations. Since data
are voluminous, a preliminary set of representatives is determined from a
sample, then the full subset is clustered by assigning each observation to the
nearest representative point. After replacing the initial representatives by the
centroids of these final clusters, the new representatives and their associated
counts constitute a compressed version, or summary, of the raw data.


Since the initial representatives are derived from a sample, the final
summary is subject to sampling variation. A statistical model for the
relationship between compressed and raw data provides a framework for assessing
this variability, and other aspects of summary quality. The procedure is being
used to produce low-volume summaries of high-resolution data products obtained
from the Multi-angle Imaging SpectroRadiometer (MISR), one instrument aboard the
NASA's Terra satellite. MISR produces approximately 2 TB per month of radiance
and geophysical data. Practical considerations for this application are
discussed, and a sample analysis using compressed MISR data presented.



Bio:

Amy Braverman is a statistician in the Earth and Space Sciences Division at the
Jet Propulsion Laboratory. Her research focuses the adaptation of data
compression techniques to produce reduced versions of massive remote sensing
data sets. She is responsible for design and implementation of algorithms for
two JPL instruments in NASA's Earth Observing System: the Multi-angle Imaging
SpectroRadiometer (MISR) and the Atmospheric Infrared Sounder (AIRS). She is
also the cognizant scientist for the MISR Data Visualization Working Group, and
collaborates with JPL's Machine Learning Systems Group on data mining problems.

Presentation (PDF File)

Back to Mathematical Challenges in Scientific Data Mining