Research Collaboration Workshop, “Women in Data Science and Mathematics”

August 7 - 11, 2023

Projects

Title: Optimizing NLP Embedding Techniques for Embedded Systems

Name: Karolyn Babalola, with Christal Gordon

Project: Text embeddings are a way of transforming language into numerical representations that can be used in deep learning architectures for language translation and generation, text summarization and sentiment analysis for a plethora of natural language processing (NLP) use cases. In NLP, embeddings are typically generated to represent semantic relationships between words or phrases; however, the size of the embedding is usually limited by the temporal and computational constraints imposed by model training and inferencing requirements.  Embedded systems, i.e. programmable devices used to perform specific tasks in computationally limited remote environments, typically impose the most stringent computational resources with the goal of optimizing output for a specific task.  More and more, embedded systems applications call for online NLP tasks to build a common operating picture in tactical environments. The goal of this project is to generate a representative number of embedded use-cases that require on-board NLP and to outline prescriptive methods for optimal text embedding generation that will fulfill the requirements of the embedded system while meeting or exceeding the processing limitations imposed by the computational constraints of each use case.

Prerequisites: Programming (Python preferred), some knowledge/exposure to deep learning, specifically transformers (e.g. Hugging Face, GPT2-3 in OpenAI), Linear Algebra

 

Title: Geometric signatures of (hierarchical) data

Name: Anna Gilbert

Project: Building trees to represent or to fit distances is a critical component of phylogenetic analysis, metric embeddings, approximation algorithms, and computational biology. It is, however, a challenging problem; indeed, many of the tree fitting problem formulations are hard (in a formal sense). Much of the previous algorithmic work has focused on generic metric spaces (i.e., those with no a priori constraints). These spaces do not capture the nature of data sets, especially those data sets that capture some sense of hierarchy. This project will explore two types of geometric signatures of (hierarchical) data and graphs, delta-hyperbolicity and average delta-hyperbolicity. We will compute these quantities for a variety of important test data sets and devise faster, approximate algorithms along the way.

Prerequisites: Linear algebra, probability, programming (Python or Julia)

 

Title: Dimension Reduction and machine learning for tensors

Name: Deanna Needell, with Jamie Haddock

Project: Data is now not only everywhere, but in such vast quantities that it makes computing quite challenging and often impossible. Moreover, the structure of data is often complicated and multi-modal. For this reason, the algebraic tensor structure has become important in data science and computational methods. There are several tensor dimension reduction techniques that do not require the tensor to be transformed to a vector or matrix, and these can be used for machine learning and reconstruction tasks. In this project, we will study these techniques and develop new methods that work in the dimension reduced space directly. Applications range from imaging to medicine, and we will apply our approaches to both real and synthetic problems.

Prerequisites: Linear Algebra, Probability, Programming (Python preferred)

 

Title: Graph-based active learning

Name Andrea Bertozzi, with Harlin Lee

Project: This project will be about semi-supervised active learning using a graph approach. Graph-based machine learning algorithms use pairwise comparisons between pieces of data to construct a similarity graph.This project will focus on the “active learning” problem in which specific data points are selected for labels as part of the training data. The algorithm selects the points and a “human in the loop” labels the data. We will consider a variety of high dimensional remote sensing data such as hyperspectral images, LIDAR and SAR data.

Prerequisites: Linear Algebra, Programming (Python preferred), Some experience with deep learning networks and/or autoencoders is helpful.  

 

Title: Feature learning and optimization techniques for machine learning tasks

Name: Yifei Lou, with Cristina Garcia-Cardona 

Project: Graph-based techniques, which embed data sets into a weighted similarity graph with vertices and edges, form powerful and popular approaches for their ability to capture the structure of the data and pairwise information. However, the success of graph-based approaches depends greatly on the quality of the features of the data used to construct the graph and the computational complexity for dealing with high-dimensional data. This project aims to integrate quality feature learning into graph-based methods to facilitate data classification task. One example of procedures that can be used to obtain high quality features is autoencoders, which can be of various structures and levels of complexity; such approaches are unsupervised and thus do not require any labeled data. The research will also develop advanced optimization-based models such as auction dynamics learning methods, maximum flow and spectral approaches for scalable computational efficiency. This project will be supplemented with applications in data science, such as hyperspectral and medical imaging.

Prerequisites: Linear algebra, optimization, programming (Matlab or Python)

 

Title: Geometric Supervised Dimension Reduction with Path Metrics

Name: Anna Little, Rongrong Wang

Project: This project will explore new approaches for constructing kernel matrices, a critical task for manifold learning and neural style transfer (i.e. using deep neural networks to learn and transfer the style of an image/audio to another). In supervised learning, a major challenge is how to efficiently incorporate the information carried by the response variables into the kernel matrix.  Existing methods based on the Euclidean dissimilarity between pairwise responses are well-explored but unsuitable for nonlinear data. This project will explore the benefits of using geometric measures on both the input features and the response variables. In particular, we will focus on the power weighted shortest path metric, which is a data-driven metric enjoying a rich geometric framework and desirable properties for clustering and dimension reduction. The resulting kernel matrix will be built based on the combination of local gradient information of the labels with power-weighted shortest path distances to stretch the data in directions useful for prediction. The resulting kernels will be evaluated in the context of data visualization and style learning using generative adversarial networks.

Prerequisites: Linear algebra, programming