Protein Folding -- Predicting Structure And Function

Exhaustive Search Techniques

Now, you may be saying to yourself "since proteins fold themselves into an optimal conformation in nature, finding the optimal fold for a protein should be a simple matter of using a computer to try all possible conformations and then pick out the optimal fold".

Simple in concept and simple to implement as a computer program, but not simple for even the fastest computers on earth to solve because there are just too many combinations of conformations to test. The runtime complexity to fold even a simple abstract protein only 10 amino acids long is 5**10 comparisons. Given that each comparison requires a microsecond to complete, 30 years will be required to find an admisible optimal protein fold.

And that's just a simple abstract protein only 10 amino acids in length!

Using exhaustive search, the time required to find the optimal conformation for a decent sized human protein would require more time than in the current age of the universe. More intelligent methods must be discovered to break down the complexity of these folding problems.

Enter The Genetic Algorithm

Since we can't expect to use exhaustive search techniques or admissible heuristics to solve this problem, many research efforts have turned to the genetic algorithm to find optimal solutions. In this role, genetic algorithms serve as smart search procedures while also incorporating a learning mechanism for enriching the search space. And these smart search techniques have been shown to converge on solutions in far less time than canonical search techniques.

The problem is that there is no free lunch. Unlike exhaustive search techniques, genetic algorithms utilize non-admissible heuristics and thus cannot gaurentee polynomial time solutions... and even if a GA arrived at the optimal solution, it wouldn't know that the solution is optimal because it hasn't examined all of the conformations possible. Gaurenteeing a solution is tantemount to saying that you know the optimal solution's fitness value ahead of time. But since fitness relies on structure, you would have to know the structure ahead of time as well. Clearly this is contradictory.

This is the so-called "termination problem" in GAs, and is also an unsolved problem.

Project Goals

The goals of my thesis project are thus twofold:

Reduce the search space so as to find more optimal or near-optimal solutions. A major difficiency of conventional GA approaches is that they spend a lot of time searching the erroneous parts of the solution space. My GA approach eliminates invalid folds so that more energy is applied to searching spaces that have a greater likelyhood of containing solutions. The solution space is reduced in size until an optimal, or set of optimal solutions, is found.
Solve the termination problem. Clearly we need to know the one optimal solution (if it exists) when we see it. This is not an easy problem to solve and is addressed by accepting results that are within a certain percentage of the theoretical optimal.

Performance Goals

The performance goal for this system is to fold a 10,000 amino-acid chain in low-order polynomial time, making the program a useful starting point for predicting realistic protein structures. Experiments conducted in "blocks world domains" where the proteins are only a few dozen amino acids long are totally unacceptable.

To verify that this scheme works in polynomial time, we will compare the theoretical worst-case performance of a theoretical exhaustive search approach against the experimental results obtained by the project. We will assume that the theoretical model uses atomic operations that require a microsecond to complete (which is quite generous). If we assume that one operation requires one microsecond then an exhaustive search approach should require 5**n search comparisons where n is the length of the protein in amino acids.

Now we choose a protein string length that would make the folding experiment intractible for practical purposes. A protein with a mere 25 amino acids requires 5**25 microseconds or 298023223876.9 seconds. That's about 94,502 years. If the GA approach to protein folding manages to terminate with the optimal solution (question - how do we know the optimal solution if we can't use an exhaustive search technique?) then we know that the system is running within polynomial time.