Towards Closer Reading of Very large-scale Text Corpus

Vwani Roychowdhury
University of California, Los Angeles (UCLA)
Professor, Electrical Engineering

A number of topic modeling techniques, including the LDA framework, have found widespread use for distant and coarse-grained reading of medium to large scale sized corpora. These widely used methods can effectively summarize corpora into hundreds of topics, which are distinct distributions over words or terms that best explain the underlying set of documents. The organization of topics and concepts are, however, very different in human brains: The memory models we naturally use are much more granular, contextual and clustered and are very different from the statistical independence-based graphical models used for most topic modeling techniques. Moreover, our brains use perceptual and contextual learning techniques allowing us to scale, unlike the global optimization techniques used in extant topic models, which make them computationally prohibitive. A second necessary ingredient missing in distant reading, is representations of relationships, sentiments and opinions among the various subjects, objects and concepts that underlie the real-world events. If we are to understand culture in more depth, we need to have close-reading computational tools that scale. In the first part of the talk I will review some of the work that I have been doing in my group on developing computational tools for scalable close reading. In the second part, I will outline my vision of a platform to create a perpetual learning machine for close reading, and why I am optimistic that with sufficient resources and collective effort, one can successfully build a computing machine that learns the way humans do, and can come close to becoming a close reader. While my vision has intersections with deep learning, it is distinct and I believe follows the true architecture of our brain much more closely than the deep learning model that was first conceived in the 1940's.


Back to Workshop IV: Mathematical Analysis of Cultural Expressive Forms: Text Data