In the last decade, as the scale of available genome and protein sequence data increased dramatically, a number of mathematical and statistical models and algorithms have been developed to address some of the questions presented by the analysis of these biological sequences. For example, we now have available a much larger bag of tools for the identification of genes, exon boundaries, regulatory elements in DNA sequences and domains in aminoacid sequences.
The present availability of complete genome sequences for a considerable number of species, together with measures of expression and a growing collection of protein structures open further challenges and opportunities. For example, evolution and speciation can be studied using entire genome sequences, rather than few genes. The extent of functional variation within the same specie can start to be explored in a more systematic fashion. Cross-genome comparisons appear to be an important tool in unravelling the more subtle aspects of gene regulation and splicing. Given the growing amount of information on protein abundance, it is possible to explore post transcriptional mechanism of gene regulation. The coordinate efforts invested in the gathering of protein structures may enable a successful investigation of the connection between sequence and structure.
When consideration is given to all these possibilities, it appears that the larger and more comprehensive datasets now available and the knowledge we have acquired so far begin to enable us to bridge the gap between the static nature of sequence information and the dynamic of biological systems. Coded in DNA is the information behind thousands of biological pathways, only a small fraction of which are actually activated in a cell at any given time. Gene identification is perhaps the first step in decoding DNA, but it provides us with only a very abstract picture of all the proteins that a cell could possibly synthetize. The identification of regulatory proteins binding sites start providing us with information on possible network connections between genes, but only a fraction of such connections are actually active and meaningful in any specific condition. DNA contains further, more subtle information that determine the cell dynamic behavior and that we can begin to explore. For example, there are elements in DNA sequence that determine alternative splicing, control the decay of mRNA transcripts, the cooperation or competition of different regulatory proteins in controlling the expression of genes; the presence of alternate pathways can be detected on a sequence level using intraspecies comparison.
The 5-day workshop will be devoted to the exploration of how sequence analysis can take advantage of the recently acquired datasets and contribute to a mechanistic understanding of the cell system. As outlined, to tackle the opportunities presented by the contemporary abundance of data, it will be necessary to combine expertise in sequence evolution, gene finding, motif recognition, alignment.
(Harvard University, Department of Statistics)
Chiara Sabatti (UCLA, Department of Human Genetics and Statistics)