Molecular biology, genetics and protein engineering have been slowly morphing into large-scale, data driven sciences that can leverage machine learning and applied statistics. My talk will be a quick tour of several projects at this intersection. I will start by explaining some modelling challenges in finding the genetic underpinnings of disease: genome and epigenome-wide associations, wherein individual or sets of (epi)genetic markers are systematically scanned for association with disease are one window into disease processes. Naively, these associations can be found by use of a simple statistical test. However, a wide variety of structure and confounding factors lie hidden in the data, such as cell type heterogeneity and population structure, leading to both spurious and missed associations if not properly addressed. Once we uncover genetic causes, genome editing may one day let us fix the genome in a bespoke manner. I will describe how we developed state-of-the-art machine learning approaches for CRISPR guide design. Finally, I will close by giving a teaser on some of our new work in machine-learning based protein optimization, wherein we seek to find, for example, the protein sequence which will give us a desired fluorescence properties.
Back to Workshop III: HPC for Computationally and Data-Intensive Problems