Towards Human-Level Recognition via Contextual, Dynamic, and Predictive Representations

Fisher Yu
University of California, Berkeley (UC Berkeley)

Existing state-of-the-art computer vision models usually specialize in single domains or tasks. This specialization isolates different vision tasks and hinders deployment of robust and effective vision systems. In this talk, I will discuss unified image representations suitable for different scales and tasks through the lens of pixel-level prediction. These connections, built by the study of dilated convolutions and deep layer aggregation, can interpret convolutional network behaviors and lead to model frameworks applicable to a wide range of tasks. Beyond scales and tasks, I will argue that a unified representation should also be dynamic and predictive. I will illustrate the case with input-dependent dynamic networks, which lead to new insights into the relationship of zero-shot/few-shot learning and network pruning, and with semantic predictive control, which utilizes prediction for better driving policy learning. To conclude, I will discuss on-going system and algorithm investigations which couple representation learning and real-world interaction to build intelligent agents that can continuously learn from and interact with the world.