Current advances in high-throughput biology are accompanied by a tremendous increase in the number of related publications.
The ability to rapidly and effectively survey the literature can support both the design and the interpretation of large-scale experiments, and the curation of structured knowledge in public biomedical databases.
In an effort to meet these goals, a variety of text-mining methods are being applied in the biomedical domain.
The talk will briefly survey such methods, and present two applications in which we use information retrieval in non-traditional ways to directly support biomedical discovery.
The first, (joint work with Edwards, Wilbur and Boguski, 2000), employs a probabilistic model of themes in scientific text, and uses thematic analysis of biomedical literature to establish functional relations among genes.
The second, (joint work with Hoeglund, Brady, Blum, Doennes, and Kohlbacher, 2007), is a new system, SherLoc, which uses text classification, and integrates text with protein sequence data to significantly improve prediction of protein subcellular localization.