Image analysis is the problem of parsing an image into an abstract, relational, description of its components. Applications abound in industrial automation, target recognition, security, data mining, and content-based search. Yet forty years of well-funded efforts by engineers, cognitive scientists, computer scientists, and mathematicians have done little to close the “ROC gap.” Biological vision is superior, especially when operating at high detection rates, where artificial systems suffer extremely low specificities.
In the cognitive sciences, compositionality refers to the apparent ability of humans to represent perceptions and thoughts in terms of a structured hierarchy of “parts,” meaning sub-representations that are themselves hierarchical and re-usable as components of other perceptions and thoughts. Language representation is the prototypical example, but the dual principles of hierarchy and reusability may prove to be central to all of cognition. I will discuss possible roles of hierarchy and reusability in computer vision, and I will develop a prototype compositional vision system.
In way of background and foundational material, I will (i) attempt to identify the main sources of poor performance of computer vision as compared to biological vision; (ii) discuss the probabilistic generative approach to complex inference problems, and emphasize the importance of proper normalization; (iii) review the basic relationships connecting probability distributions, graphical structures, and computational complexity, emphasizing tree-structured graphs and their relationship to grammars and parsing.
Probabilistic generative models are Bayesian: a probability (the “prior”) is placed on a suitably defined space of interpretations; given an interpretation, an observation model defines a conditional data likelihood, preferably at the level of pixel intensities. In principle, and to a degree in practice, images sampled from the model validate assumptions or suggest improvements.
I will develop a demonstration system, probabilistic and generative, built upon a “Bayesian net” (Markov random field) backbone. I will argue, and attempt to demonstrate, that the Markov property, although computationally convenient, is untenable for image and language parsing. Markov models are convenient computationally but too weak (too broad) in their coverage. I will introduce a non-Markovian perturbation on the Markov backbone that dramatically improves the fit, while preserving some of the computational advantages of Bayesian nets. Experiments in license-plate reading and face detection illustrate the importance of hierarchy, reusability, and non-Markovian dependencies.
Presentation (PDF File)
Presentation (PowerPoint File)