Dataset Search and Augmentation

Juliana Freire
New York University

The growing number of available structured datasets, from Web tables and open-data portals to enterprise data, open up new opportunities to enrich analytics and improve machine learning models through data augmentation. While dataset search engines for the Web and enterprises provide a first step towards in improving dataset findability, their query interfaces are limited, supporting only simple, keyword-based queries and faceted search. In this talk, I will discuss a new class of the dataset search queries that uncover relationships between datasets and support data augmentation. Concretely, given as input a dataset D, in the context of an analytics question A or a predictive
model M, a data augmentation query returns a ranked list of datasets that are related to D and that answer A or enhance the performance of M. I will present our ongoing research on techniques to support the efficient evaluation of augmentation queries as well as to present search results so that users can make sense of the data and effectively perform relevance judgements about their suitability for a given task.

Bio
Juliana Freire is a Professor of Computer Science and Data Science at New York University. She is the elected chair of the ACM Special Interest Group on Management of Data (SIGMOD). She served as a council member of the Computing Research Association’s Computing Community Consortium (CCC), and was the lead investigator and executive director of the NYU Moore-Sloan Data Science Environment. She develops methods and systems that enable a wide range of users to obtain trustworthy insights from data. This spans topics in large-scale data analysis and integration, visualization, machine learning, provenance management, and web information discovery, and different application areas, including urban analytics, predictive modeling, and computational reproducibility. Freire has co-authored over 200 technical papers (including 8 award-winning publications), several open-source systems, and is an inventor of 12 U.S. patents. According to Google Scholar, her h-index is 59 and her work has received over 14,000 citations. She is an ACM Fellow and a recipient of an NSF CAREER, two IBM Faculty awards, and a Google Faculty Research award. She was awarded the ACM SIGMOD Contributions Award in 2020. Her research has been funded by the National Science Foundation, DARPA, Department of Energy, National Institutes of Health, Sloan Foundation, Gordon and Betty Moore Foundation, W. M. Keck Foundation, Google, Amazon, AT&T Research, Microsoft Research, Yahoo! and IBM. She received a B.S. degree in computer science from the Federal University of Ceara (Brazil), and M.Sc. and Ph.D. degrees in computer science from the State University of New York at Stony Brook.


Back to Workshop III: Large Scale Autonomy: Connectivity and Mobility Networks