Protein Folding -- Predicting Structure And Function

Protein Folding

Proteins are both the engines and the building blocks of all living things, thus an understanding of their structure and behavior is essential to understanding how living things operate. My thesis project is a computer program designed to predict the three-dimensional structure of proteins given only their amino acid sequence. This system is also the first of a family of computer programs whose purpose is to assist analysts exploring protein structure and function. The analysis is performed entirely in the digital domain, using only existing DNA and RNA sequence data, protein homology and characteristcs databases.

How Proteins Fold

All proteins in nature are made up of chains of molecules called "amino acids". Cells create proteins by "transcribing" them from RNA sequences (themselves being created from DNA sequences). When proteins are transcribed from RNA they start out as linear sequences of amino acids. Because the amino acids that make up a protein have various electrostatic and mechanical properties, the protein doesn't stay in this dentured form for long and begins to fold up into a three-dimensional structure. It is this three-dimensional structure (as well as the mechanical and electrostatic properties of the amino acid sequence) that gives the protein its functionality.

For example, two proteins might fold themselves in such a way that one protein presents a "lock" binding site to the other protein's corresponding "key". Fitting the key into the lock produces an electrochemical reaction that performs some essential cellular function.

The transcribed sequence of amino acids that form a protein are called the protein's "primary structure". The folded form of the protein in three-space is called the protein's "secondary structure". The secondary structure of a protein is determined in large part by the mechanical and electrostatic effects of neighboring amino acids. Proteins also have "tertiary" and "quartenary" structures. The tertiary structure refers to the overall folding path of a protein. For example, a protein might have a helical secondary structure whereas its tertiary structure might fold the overall protein into a "supercoil" where the helical protein coils around itself. The mechanics of how a protein can fold, determine a protein's structure. Tertiary structure prediction is the rough part and the focus of my thesis project, although to predict an overall fold, all constraints from local to global folding must be considered.

The quartenary structure of a protein refers to an assemblage of multiple protein strings along with the so-called "post-translational modifications" to the protein strings. "Post-translational modification" means folding or alterations of the protein string that have occured outside of the protein's inate structure or expression. A good example of a post-translational modification is the addition of a "heme group" to hemoglobin molecules. Without this heme group, red corpuscles would be unable to carry oxygen.

One feature of proteins in nature that seems to be very consistant is that when they do fold, they fold into the most energy-conservative structure possible, that is to say that the amino acids are at total rest and the protein is expending no energy to maintain its structure. This fact provides us with a key to reliably predicting a protein's structure. Theoretically, all we have to do is find the optimal conformation among all the possible conformations a protein can take.

In practice, however, this is an impracticle solution. The amount of time required to test all possible conformations that a decent size protein can take on is far greater than the age of the universe, even for the fastest computers.

Hydrophobic Packing Models

One of the properties of amino acids which is thought to determine most of a protein's resulting structure is the amino acid's "hydrophobicity", or its afinity for water. This makes sense, because all proteins are folding within a cytoplasmic medium which consists of mostly water. If one labels each amino acid as "hydrophillic" or "hydrophobic" and then considers this property as the only mechanism of folding (but retaining a protein's expected sequential structure) then one has a macroscopic model for folding abstractions of proteins; hydrophobic amino acids move towards each other and the protein's "center" away from the cell's cytoplasm. To further simplify the problem (but not remove the essential computational complexity of the problem) we can perform this folding within a discrete cartesian lattice space.

Such abstract models of proteins are termed "Hydrophobic Packing" models or "HP models" for short and have been investigated by many researchers, most notably K. A. Dill.

These abstracted protein models are no less difficult to solve computationally. Abstracting the problem just removes the noise from the problem and allows us to focus on the core difficulties of predicting protein structure. To this day, the protein folding problem, as well as the prediction of abstracted protein folds, remain unsolved problems.

Even after the protein has been abstracted, the protein folding problem appears to retain its NP-complete characteristics, which is good because we want to remove the problem background noise without removing the constraints that make the problem difficult to solve... and thus scalable to folding real-world proteins. Since the protein folding problem is generally regarded as NP-complete, I have discarded from consideration any conventional problem-solving techniques (such as exhaustive search of the solution space). Any solution to this problem must fold real-world size proteins within polynomial time.