Unsupervised Learning of Natural Languages

David Horn
Tel Aviv University
Mathematics

We address the problem, fundamental to linguistics,bioinformatics and certain other disciplines, of using corpora of raw symbolic sequential data to infer underlying rules that govern their production. Given a corpus of strings (such as text, transcribed speech, chromosome or protein sequence data, sheet music, etc.), our
unsupervised algorithm recursively distills from it hierarchically structured patterns. The ADIOS (Automatic DIstillation of Structure) algorithm relies on a statistical method for pattern
extraction and on structured generalization, two processes that have been implicated in language acquisition. It has been evaluated on
artificial context-free grammars with thousands of rules, on natural languages as diverse as English and Chinese, and on protein data correlating sequence with function. This is the first time an
unsupervised algorithm is shown capable of learning complex syntax, generating grammatical novel sentences, and proving useful in other
fields that call for structure discovery from raw data, such as bioinformatics.


This lecture is based on joint work with Zach Solan, Eytan Ruppin and Shimon Edelman that has been published in PNAS, August 2005

Audio (MP3 File, Podcast Ready) Presentation (PDF File)

Back to Document Space