Proteomics: Sequence, Structure, Function

March 8 - June 11, 2004

Overview

An organism’s proteome is the collection of all proteins that the organism makes. This differs from the genome in a number of ways. There are pre-translational events such as alternative splicing (in which several coding regions of DNA [“exons”] are joined together, but it is known that this can be done in more than one way, and that which way the exons are spliced is one way that protein expression is regulated) and post-translational events such as phosphorylation (which may sometimes determine whether a protein is active or not). Which proteins are present and in what quantities in a given cell at a given time is highly variable, whereas the genome is static.

Proteomics necessarily involves a greater level of complexity than genomics. The human proteome — the collection of all proteins generated by the human genome — is estimated to contain 10 million to 20 million proteins — about 2-3 orders of magnitude higher than the number of human genes.

The practical distinction between genomics and proteomics is what you focus on — whether DNA is the center of attention, or whether proteins are the central object. The basic reason why genomics took off ahead of proteomics is that techniques for high-throughput sequencing became available — this involved a convergence of innovative biotechnologies and ingenious fast algorithms. Because DNA is just a linear sequence of 4 symbols, high-throughput comparisons of DNA sequences with each other gave an added dimension to the subject. This has generated a lot of really interesting problems and there continues to be active progress in many directions as mathematical abstraction, primarily in the form of new algorithms and statistical methods, meets biological reality. New technologies, such as microarrays, the subject of IPAM’s first long program, continue to stir things up. This area of research is so highly active that many proteomic strategies incorporate a genomic component.

Proteins direct almost all biological functions, but how any given protein acts is rarely transparent. Proteins do not function independently, but interact in highly complex networks, which in turn influence the intricate network of regulatory mechanisms by which the amount of each protein that is produced is regulated. These regulatory mechanisms, despite many tantalizing clues, remain mysterious and are difficult to model. It is a fundamental problem to determine which proteins interact, and how they fit into a network of interactions, and this becomes even more challenging if one wants to do it in an automated way. Many current approaches attempt to integrate together information from a number of sources — data from microarrays that give information about which proteins are up and down-regulated together, cross-genome analysis about whether the genes for two proteins have evolved together, etc.

The sequence of amino acids in each protein is determined by genomic data, plus some variation coming from alternative splicing. The secondary structure of the protein (alpha-helices, etc.) and the 3-D configuration of the protein are determined by the amino acid sequence, but we are very far from being able to predict one from the other. Approaches vary widely, ranging from attempting to do an ab initio computation from quantum mechanical forces to homology modeling, which operates by using data-mining to make comparisons with sequences of proteins whose secondary or 3-D structure is known.

The 3-D configuration of a protein is important in determining its function. Certain regions of the protein mainly serve to hold certain active regions in the right position, so that they can interact correctly with active regions of other proteins. Finding the right way to determine the active region of a protein, even once its shape is known, is a topic of active research.

There are two basic ways to learn of a protein’s existence — by finding the nucleotides that code for it in the organism’s DNA, or by finding it already assembled in the organism’s cells. The first method is much less direct, and less certain, because the identification of exons within the DNA is still an art, and reconstructing how the exons are spliced is uncertain. A second disadvantage is that one does not have the protein itself in hand, to study its chemistry and structure. The big advantage of this approach is that it is high-throughput, so that one can find the sequences of vast numbers of hitherto unknown proteins in this way. A second advantage of this genetic approach is that one can study the effect of the protein indirectly by producing “knockout” organisms, which differ from the wild type in having exactly one gene disabled. New technologies now hold out the possibility of a high-throughput approach that starts with the protein rather than with the DNA that codes for it.

Proteomics is of importance to basic science, for elucidating the fundamental mechanisms of biology. It is also of great interest to the biotech industry, since proteins whose function and active regions are known provide potential targets for drugs.

The human proteome — the collection of all proteins generated by the human genome — is estimated to contain 10 million to 20 million proteins — about 2-3 orders of magnitude higher than the number of human genes. Proteins direct almost all biological functions, but how any given protein acts is rarely transparent. Moreover, proteins do not function independently, but interact in highly complex networks, which in turn influence the intricate network of regulatory mechanisms by which the amount of each protein that is produced is regulated. These regulatory mechanisms, despite many tantalizing clues, remain mysterious and are difficult to model. The goal of this long program is to bring together mathematicians, biologists, and researchers from industry to study these issues.

Organizing Committee

Tim Ting Chen (University of Southern California)
David Eisenberg (UCLA)
Scott Fraser (California Institute of Technology)
Jing Huang (UCLA)
Simon Tavaré (University of Southern California)
David Wild (Keck Graduate Institute)