A TEXT BOOK ON COMPUTATIONAL MOLECULAR BIOLOGY- BY ZAHOORULLAH S MD: 2011

UNIT IV: PROTEOMICS I

Introduction to proteins

Proteins are polypeptides, which are made up of many amino acids linked together as a linear chain. The structure of an amino acid contains a amino group, a carboxyl group, and a R group which is usually carbon based and gives the amino acid it's specific properties.These properties determine the interactions between atoms and molecules, which are: van der Waals force between temporary dipoles, ionic interactions between charged groups, and attractions between polar groups.

Proteins form the very basis of life. They regulate a variety of activities in all known organisms, from replication of the genetic code to transporting oxygen, and are generally responsible for regulating the cellular machinery and determining the phenotype of an organism. Proteins accomplish their tasks in the body by three-dimensional tertiary and quaternary interactions between various substrates. The functional properties depend upon the proteins three-dimensional structure. The (3D) structures arise because particular sequences of amino acids in a polypeptide chain fold to generate, from linear chains, compact domains with specific structures. The folded domains either serve as modules for larger assemblies or they provide specific catalytic or binding sites.

Protein Stability

As odd as it may seem native (folded) proteins are only marginally stable under physiological conditions. Other forces such as hydrophobic effects, electrostatic interactions, and hydrogen bonding act more as stabilizing factors and are the main factors in driving the protein folding process.

Hydrophobic effects cause the nonpolar substances to minimize their contact with water, which is the major determinant of native protein structure. The aggregation of the nonpolar side chains in the interior of a protein is favored by the increase in entropy of the water molecules that would otherwise form ordered cages around the hydrophobic groups.

In the interior of the protein where the molecules are closely packed, van der Waals forces are relatively weak, but only act for a short time because these forces are lost when the protein is unfolded.

Hydrogen bonding is important because proteins fold in such a way to prevent hydrogen bonds, because the stabilizing energy of the hydrogen bond would be lost when folding and unfolding occurred.

Largely the residues that occupy the interior of the protein direct protein folding. Returning to its native conformation occurs within a few seconds, which supports the idea that proteins have direct pathways to arrive at the native state. As the protein folds its free energy decrease which makes it a one-way process. The last stages of protein folding depend on the specific sequence of the amino acids.

Thermodynamics

Entropy - The disorder of matter. *Entropies of more complex molecules are larger then those of simpler molecules, especially in a series of closely related compounds.

Second Law of Thermodynamics - The total entropy of the universe is continually increasing.

Protein stability depends basically in the free energy change between the folded and unfolded states, which is expressed by,

-RT ln K = DG = DH - TDS

Where R represents the Avogadro number, K, the equilibrium constant, G, the free energy change between folded and unfolded, H, the enthalpy change and S, the entropy change from folded to unfolded. The enthalpy change, H, corresponds to the binding energy (dispersion forces, electrostatic interactions, van der Waals potentials and hydrogen bonding) while hydrophobic interactions are described by the entropy term, S. Proteins become more stable with increasing negative values of, G. As the binding energy increases or the entropy difference between the two states decreases, the folded protein becomes more stable.

The thermodynamics of protein stability can best be model by the Energy Landscape Theory. This describes where the energy of a protein is a function of the topological arrangement of the atoms. A spatial surface with a very large number of different co-ordinates and energy values separated by mountains and ridges. Each value in this surface describes the protein in a specific conformation, and there is an energy landscape for each state of the protein.

Protein identification

Recent advances in protein methods have led to the application of mass spectrometry to the identification of proteins by Peptide Mass Fingerprinting (PMF). In this process, target proteins are isolated by SDS gel electrophoresis and are digested directly in the gel slice with trypsin, other proteases or cleaving chemicals. The resulting peptides are extracted and the unfractionated mixture is analyzed by MALDI-TOF mass spectrometry. The masses of the resulting peptides are used to query a database of protein and DNA sequences for likely candidate proteins. This is accomplished by matching the measured masses with masses predicted from the databases after 'virtual' digestion with the proteinase or chemical cleavage agent. The more peptide masses that match the predicted masses, the more certain one is of the likelihood that the protein is identified. Clearly, high-mass accuracy is required for this method to be of use. Sometimes no matches are found or the level of certainty is too low. Working with species whose genome is complete is a benefit in experiments of this nature.

Amounts of material required for the mass fingerprinting approach are dependent upon the sample and its preparation. In our laboratory, with our validated procedures, we are able to perform mass mapping/protein identification experiments with as little as 2 pmol (loaded on the gel) of sample. However, more sample increases the chances of a successful outcome. The success of any experiment will depend upon reasonable sample estimates and very careful sample handling. We recommend that you contact us before planning any mass mapping experiments to help you estimate the amount of protein and describe preferred handling techniques.

Several drawbacks to PMF need to be remembered when planning your experiment. One is the inability to specifically isolate the amino terminus of the protein. Therefore, amino acid sequence information about this region of the molecule is often difficult to obtain using mass spectrometry. Second, due to mass similarities of amino acids it is often difficult to correctly identify amino acids (especially during MS/MS or de novo experiments. For example, leucine and isoleucine have isobaric masses and cannot be differentiated by PMFor MS/MS. Other amino acids (for example, Glutamine and Lysine) have relatively small mass differences and can be mis-identified by these methods.

Correct identification of a recombinant protein (and sequence) can often be accomplished using mass spec and PMF alone, but sometimes direct sequencing by Edman techniques is required. Therefore, it is our preferred approach to combine the complementary techniques of mass spectrometric mapping and chemical protein sequencing (Edman degradation) to identify proteins. This is especially true when you are working in 'non-genome' species.

A detailed note can be obtained from this below site

http://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCQQFjAA&url=http%3A%2F%2Fbrc.se.fju.edu.tw%2Fplans%2Fpdf%2F53identify.pdf&ei=VFKNUK-ZEobjrAfp64A4&usg=AFQjCNGKv9hxbqqdrwdqO6DPuuJtgy0h1g&sig2=b-sRiKJT5Bbf7PyfRNhaag

structure and function determination

refer the following site

http://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&ved=0CDMQFjAC&url=http%3A%2F%2Fwww.eolss.net%2FSample-Chapters%2FC03%2FE6-54-02-02.pdf&ei=8lKNUOvhFsSPrgfTm4GIAg&usg=AFQjCNFUyVI0htXRNMa7X4W429-M4_48HA&sig2=6YaTPy83x_G32OZ5YiwQmw

Structure comparison methods

refer the following site

http://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0CDIQFjAB&url=http%3A%2F%2Fwww.cs.duke.edu%2Fcourses%2Ffall08%2Fcps234%2Fprojects%2Fpaper.pdf&ei=oFONUJCZCYbkrAeBlIG4BA&usg=AFQjCNE4f5l89ETryUz7Kc2lGzWvz_vXWQ&sig2=VADu9mJ4e3_V2HNwVTTXBQ

Prediction of secondary structure from sequence

Expression presently supports nine different secondary structure prediction algorithms. All computations are performed via the Network Protein Sequence Analysis server (PBIL, France).

DPM

The DPM (Double Prediction Method) algorithm uses two approaches to produce the final result - first it predicts the protein structural class and then the secondary structure for the sequence (Deleage and Roux, 1987). The DPM method can be divided into four steps:

Prediction of the structural class of a protein from AA composition (Nakashima et al., 1986).
Preliminary secondary structure estimation from a simple algorithm.
Comparison between the two independent predictions.
Optimisation of parameters and determination of secondary structure.

DSC

DSC (Discrimination of protein Secondary structure Class) is based on dividing secondary structure prediction into the basic concepts and then use of simple and linear statistical methods to combine the concepts for prediction (King and Sternberg, 1996). This makes the prediction method comprehensible and allows the relative importance of the different sources of information used to be measured.

At NPS@, a BLASTP search of your sequence is performed against the SWISS-PROT database. These results are filtered and then aligned by CLUSTALW. The resulting alignment is the input for DSC.

GORIV

GOR IV is the fourth version of GOR secondary structure prediction methods based on the information theory (Garnier et al., 1996). There is no defined decision constant. GOR IV uses all possible pair frequencies within the window of 17 amino acid residues. After crossvalidation on a data base of 267 proteins, the version IV of GOR has a mean accuracy of 64.4% for a three state prediction (Q3).

HNN

The HNN (Hierarchical Neural Network) prediction method can be seen as an improvement on the famous classifier developed by Qian and Sejnowski, and derived from the system NETtalk (Guermeur). As its predecessor, it is made up of two networks: a sequence-to-structure network and a structure-to-structure network. The prediction is thus only based on local information. The improvements mainly deal with two points:

Technical tricks (recurrent connections, shared weights etc.) have been used to increase the context on which the prediction is made and concomitantly decrease by two orders of magnitude the number of parameters (weights).
Physico-chemical data have been explicitly incorporated in the predictors used by the structure-to-structure network.

These modifications have significantly improved the error in generalization.

MLRC

MLRC (Multivariate Linear Regression Combination) is a secondary structure prediction method which combines GOR4, SIMPA96 and SOPMA (Guermeur et al., 1999). It post-processes the outputs of protein secondary structure prediction methods and generates class posterior probability estimates. Experimental results establish that it can increase the recognition rate of methods that provide inhomogeneous scores, even if their individual prediction successes are largely different.

Note: The MLRC algorithm may take several minutes to compute larger sequences (>500 amino acids).

PHD

PHD are neural network systems (a sequence-to-structure level and a structure-structure level) to predict secondary structure (PHDsec), relative solvent accessibility (PHDacc) and transmembrane helices (PHDhtm) (Rost and Sander, 1993). The NPS@ server only uses PHDsec. PHDsec focuses on predicting hydrogen bonds. The procedure essentially involves executing a BLASTP search of your sequence, filtering these results and aligning them with CLUSTALW, then using the multiple alignment as the input of the neural network. The PHD prediction done with NPS@ is better than the PHD prediction on the single sequence. But it's not exactly the same and could be a little bit less accurate than the PredictProtein one.

Note: The PHD algorithm may take several minutes to compute larger sequences (>500 amino acids).

Predator

PREDATOR is a secondary structure prediction method based on recognition of potentially hydrogen-bonded residues in a single amino acid sequence (Frishman and Argos, 1996).

SIMPA96

SIMPA96 is a nearest neighbor secondary structure prediction method (Levin, 1997). It's based on the homologue method described by Levin et al. (1986).

SOMPA

SOPMA (Self-Optimized Prediction Method with Alignment) is based on the homologue method of Levin et al. (1986). The improvement takes place in the fact that SOPMA takes into account information from an alignment of sequences belonging to the same family (Geourjon and Deleage, 1995).

UNIT V: PROTEOMICS II

Protein homology modeling

Homology models are useful to get a rough idea where the alpha carbons of key residues sit the folded protein. They can guide mutagenesis experiments, or hypotheses about structure-function relationships.

Homology models are unreliable in predicting the conformations of insertions or deletions, i.e. portions of the sequence that don't align with the sequence of the template, as well as the details of sidechain positions.

Homology models are unlikely to be useful in modeling ligand docking (drug design) unless the sequence identity with the template is >70%, and even then, less reliable than an empirical crystallographic or NMR structure.

SWISS-MODEL makes it quick and easy to submit a target sequence and get back an automatically generated homology model, provided an empirical structure with >30% sequence identity exists to use as a template. (The template will be identified automatically, and the alignment made automatically.) These automated models may be useful, but will sometimes have errors that could be avoided if manual adjustments are made to the sequence alignment by an expert. Learning to optimise your models manually would take some time.

1. What is homology modeling?

Suppose you want to know the 3D structure of a target protein that has not been solved empirically by X-ray crystallography or NMR. You have only the sequence. If an empirically determined 3D structure is available for a sufficiently similar protein (50% or better sequence identity would be good), you can use software that arranges the backbone of your sequence identically to this template. This is called "homology modeling". It is, at best, moderately accurate for the positions of alpha carbons in the 3D structure, in regions where the sequence identity is high. It is inaccurate for the details of sidechain positions, and for inserted loops with no matching sequence in the solved structure.

A homology modeling routine needs three items of input:

The sequence of the protein with unknown 3D structure, the "target sequence".
A 3D template is chosen by virtue of having the highest sequence identity with the target sequence. The 3D structure of the template must be determined by reliable empirical methods such as crystallography or NMR, and is typically a published atomic coordinate "PDB" file from the Protein Data Bank.
An alignment between the target sequence and the template sequence.

First, the homology modeling routine arranges the backbone identically to that of the template. This means that not only the positions of alpha carbons, but also the phi and psi angles and secondary structure, are made identical to the template. Next, the more sophisticated homology modeling packages adjust sidechain positions to minimize collisions, and may offer further energy minimization or molecular dynamics in an attempt to improve the model.

2. How good can homology modeling be?

Two proteins with a high level of sequence identity, and very similar secondary and tertiary structure (identical "folds"), will nevertheless have not exactly identical backbone conformations, even when determined under comparable conditions. A homology model can be expected to differ from the real structure to at least this extent. Overall differences in protein backbone structures are quantitated with the root mean square deviation of the positions of alpha carbons, or rmsd. "A model can be considered 'accurate enough' or as 'accurate as you can get' when its rmsd is within the spread of deviations observed for experimental structures displaying a similar sequence identity level as the target and template sequences" (Schwede et al., 3DCrunch). How big is this spread?

The 3DCrunch project used the SWISS-MODEL routines to homology model all sequences in the Swiss-Prot database for which appropriate templates exist. In the same project, in order to assess the accuracy of homology modeling, 1,200 models were made for previously solved structures (see Reliability of models generated by SWISS-MODEL). This enabled comparisons of homology models with empirical structures for the same sequence, where the homology model was made using a template with the most similar sequence available, other than the target sequence itself.

To provide a frame of reference for rmsd values, note that up to 0.5 Å rmsd of alpha carbons occurs in independent determinations of the same protein (Chothia and Lesk, 1996). Proteins with 50% sequence identityhave on average 1 Å rmsd ( Schwede et al., 3DCrunch). The values given above are for X-ray crystallographic determinations; NMR determinations have rmsd's several fold higher.

If we define a "highly successful homology model" as one having <=2 Å rmsd from the empirical structure, then the template must have >=60% sequence identity with the target for a success rate >70%. Even at high sequence identities (60%-95%), as many as one in ten homology models have an rmsd >5 Å vs. the empirical structure. Below 40% sequence identity, serious errors begin to appear more often. For the complete distribution of results, see Reliability of models generated by SWISS-MODEL

3. The importance of the sequence alignment.

The homology modeling routine will proceed to arrange the backbone of the target sequence according to that of the template, using the sequence alignment to decide where to position each residue. Therefore, the quality of the sequence alignment is of crucial importance. Misplaced indels (gaps representing insertions or deletions) will cause residues to be misplaced in space. Although there are many routines that will do alignments automatically, careful inspection and adjustment by someone with specialized training may improve the quality of the alignment, and hence, of the homology model.

4. Databases of Ready-Made Homology Models.

ModBase is worth checking because if you find a model, it provides a PIR-formatted sequence alignment ready to paste into Protein Explorer's MSA3D. 3DCrunch does not provide this. It might also be worth comparing models of the same sequence from ModBase vs. SWISS-MODEL because they use different algorithms.

It is quicker and easier to submit your sequence to SWISS-MODEL than to try to find a model in 3DCrunch, and you'll get the same "first approach" results either way. 3DCrunch appears not to have been updated since 1998, and only sequences in Swiss-Prot/TrEMBL were modeled, whereas you can submit any sequence to SWISS-MODEL.

ModBase (Andrej Sali et al., Rockefeller U, NY). Over 200,000 models, last updated July 2000. If your search finds models, click on the icon in the "Template-based view" column to get the model. If you find a model here the PIR alignment link will generate the alignment of the template with the target ready to paste into Protein Explorer's MSA3D. This will color the model by identity/similarity/difference from the template. Inserted loops are colored 'different'
3DCrunch (Manuel Peitsch et al., GlaxoWellcome). 64,000 models made in 1998 from sequences in Swiss-Prot/TrEMBL using the SWISS-MODEL routines. Particularly interesting are the control data,Reliability of models generated by SWISS-MODEL.

5. Introductions to the Principles of Homology Modeling.

Homology Modeling David R. Bevan, Virginia Tech.
Professional Gambling, R. Rodriguez, Gert Vriend, EMBL Heidelberg, Germany (since 2000, Univ. Nijmegen, Netherlands).
How to evaluate the quality of a model, Torsten Schwede, Manuel C. Peitsch & Nicolas Guex, ExPASy, Geneva, Switzerland.

6. How To Do Homology Modeling.

Molecular Modeling for Beginners by Gale Rhodes, Univ. Southern Maine, includes an introduction to DeepView, and a superb tutorial on homology modeling (look through the left index frame for the link to Homology Modeling).

This is the best starting place for beginners who want to learn about homology modeling. It guides you through the use of NCBI Entrez to find a sequence in the human genome, using SWISS-MODEL to get a homology model, and most importantly, using DeepView to visualize and evaluate the model.

DeepView (also known as SwissPDBViewer) is an excellent free modeling program by Nicolas Guex, Alexandre Diemand, Torsten Schwede & Manuel C. Peitsch at GlaxoWellcome. DeepView resources are indexed at molvisindex.org. DeepView comes with a built-in tutorial on homology modeling. This tutorial walks you through the steps but does not explain in detail what the program is doing. The SWISS-MODEL homology modeling server returns a DeepView-ready PDB file, with the model and each template in a different layer. DeepView has automated routines to display the sequence alignment, adjust gap positions, show energetically unfavorable regions of the alignment, find and fix sidechain clashes. It is very powerful but the many keyboard shortcuts and hard-to-find options make it a challenge to use effectively on an occasional basis.

7. Homology Modeling Servers and Software.

SWISS-MODEL, An Automated Comparative Protein Modelling Server, Torsten Schwede, Manuel C. Peitsch & Nicolas Guex, ExPASy, Geneva, Switzerland.

You just submit your sequence! It finds the best template (if one exists), aligns the sequences, and returns the PDB file to you automatically. You can choose whether to get back a 3D alignment of the model with the template(s), or just the model.

DeepView: Integrated with SWISS-MODEL. See above under Tutorials.
WHAT IF Web Interface (click on Build/check/repair model). Roland Krause, Gert Vriend, Univ. Nijmegen (in USA, say "Nigh-maygen"), Netherlands.

To use the WHAT IF model builder, you must choose your template and prepare your alignment first.

The following opinion was sent to the Protein Data Bank Discussion Forum in November, 1999 by Gert Vriend:

One of the goals of the WHAT IF homology modelling module is to produce models that are as good as possible. Another goal is to make errors as obvious as possible when they are unavoidable. Todays modelling technology (which includes MD programs) cannot yet predict where a loop will find its new position if it is disturbed by for example mutations or by binding a ligand or sugar. In WHAT IF we therefore decided not to make a random motion (and without insult meant to my friends in the MD world, optimising a mutated loop by MD invariably looks like a random motion) but just to leave the backbone as 'untouched' as possible. The results in the biannual CASP competition are every round making more clear that this is (still) the best strategy. However, not moving the backbone accounts for about 2/3-rd of the total modelling error in WHAT IF's models.

There are several other homology modeling servers, but they appear less fully developed than the two above.

Protein threading

Detailed ppt is on

http://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0CCwQFjAB&url=http%3A%2F%2Fwww.biostat.wisc.edu%2Fbmi776%2Flectures%2Fthreading.pdf&ei=eUaNUMSBG8XZrQe-oYCIDA&usg=AFQjCNFXue8RQ7xWdaSQPwUGnESvr4d-ig&sig2=uEpFnGyu2MBCBxOuzVk1Xw

Protein ab initio structure prediction

Introduction:

The biological role of a protein is determined by its function, which is in turn largely

determined by its structure. Thus there is enormous benefit in knowing the three

dimensional structures of all the proteins. Although more and more structures are

determined experimentally at an accelerated rate, it is simply not possible to determine all

the protein structures from experiments. As more and more protein sequences are

determined, there is pressing need for predicting protein structures computationally.

Decades of intense research in this area brought about huge progress in our ability to

predict protein structures from sequences only. The protein structure prediction methods

can be broadly divided into three categories:

1) homology modeling,

2) threading or fold

recognition, and

3) Ab Initio. Essentially, the classification reflects the degree to which

different methods utilize the information content available from the known structure

database. In the following, I will briefly discuss each kind of methods and their accuracy,

applicability and shortcomings. Possible improvements to protein structure prediction are

also discussed.

Comparative homology modeling:

So far protein prediction methods based on homology have been the most successful.

Homology modeling is based on the notion that new proteins evolve gradually from

existing ones by amino acid substitution, addition, and/or deletion and that the 3D

structures and functions are often strongly conserved during this process. Many proteins

thus share similar functions and structures and there are usually strong sequence

similarities among the structurally similar proteins. Strong sequence similarity often

indicates strong structure similarity, although the opposite is not necessarily true.

Homology modeling tries to identify structures similar to the target protein through

sequence comparison. The quality of homology modeling depends on whether these

exists one or more protein structures in the protein structure databases that show

significant sequence similarity to the target sequence.

There are usually four steps in homology based protein structure prediction methods:

(1)identify one or more suitable structural templates from the known protein structure

databases;

(2) align the target sequence to the structural template;

(3) build the backbone

from the alignment, including the loop region and any region that is significantly

different from the template; and

(4) place the side-chains. The first two steps, identification of structural templates and alignment of the target sequence onto the parent structures, are usually related. Sequence comparison methods determine sequence similarity by aligning the sequences optimally. The aligned residuals of the structure templates are used to construct the structural model in the second step.

The quality of the sequence comparison thus not only determines whether a suitable structural template can be found but also the quality of the alignment between the target sequence and the parent structure, which in turn determines the accuracy of the structural model. Of critical importance is the ability for the sequence comparison to detect remote homologues and to correctly align the target sequence to and parent structure. In the following discuss the various sequence comparison methods in relation to homology modeling and their range of applicability, accuracy and shortcomings.

For comparative modeling, local sequence comparison methods are usually used since the

sequence similarity is most likely over segments of the two sequences. The local

sequence comparison can either be pair wise or profile based. Pair wise comparisons,

such as the widely used BLAST (Altschul, 1990) in the early days, can detect sequence

similarities better than 30%. A number of tools have also been developed to detect weak

homology relationships. Methods like profile (Gribskov, 1987) and HMM (Krogh, 1996)

use a statistical profile of a protein family. To further increase the chance of detecting

remote homologues, PSI-BLAST (Altschul, 1997) and SAM-T98 (Karplus, 1998) build

the profile or HMM by searching the database iteratively until no new hits are found.

Methods such as PSI-BLAST encode the information about a whole protein family for

the target sequence in a model to increase the chance of detecting remote homologies. To

further increase the detection sensitivity, the sequences in the structure database can also

be encoded in profiles. This forms the basis of the profile-profile based comparison

methods (Koehl, 2002). With low sequence identities (<20%), profile-profile methods

clearly outperform the other two kinds of methods (Sauder, 2000): profile-profile

methods identified more than 90% of homologous pairs, determined from structurestructure

similarity comparison, with sequence identity better than 10% and an impressive 38% even for cases with sequence identities between 5% and 9%.

The structure models are constructed from the residuals of the structure template that are

aligned to the target sequence in the sequence comparison. The quality of this alignment

thus is critical for the accuracy achievable. The aligned residues from sequence

comparison are generally different from that from structure-structure comparison though,

especially when the sequence identity is low. To assess the ability of the sequence

comparison methods to align the sequences correctly, it is instructive to compare the

sequence-sequence alignment to the structure-structure alignment of the same pair of

proteins. To determine how well the different similarity search methods can detect

remote homologies and assess their ability in correctly aligning the sequences, Sauder et

al. (Sauder, 2000) compared various sequence alignment methods to the CE (Shindyalov,

1998) structure alignment of the SCOP (Murzin, 1995) protein structures. For sequence

identities less than 30%, profile-based comparison methods, such as PSI-BLAST and

profile-profile comparison, are all obviously better than the pair wise BLAST method.

For example, at 10-15% sequence identity, BLAST aligns only 20% correctly while PSIBLAST and profile-profile comparison can correctly align 40% and 48% respectively.

This also indicates that there is still large room for improvement in correctly aligning the

target sequence to the target structure.

One indication of the accuracy of comparative modeling is the sequence identity between

the target and the template. It is believed that if two protein sequences have 50% or

higher sequence identity, then the RMSD of the alignable potion between the two

structures will normally be less than 1_ (Gerstein, 1998). In the so-called “twilight zone”

(Doolittle, 1986), with sequence identity between 20%~30%, 95% of the sequences with

this level of identity have different structures though (Rost, 1999). When a structure

template can indeed be found within the known protein structure databases in such cases,

the backbone RMSD can be expected to be no better than 2_ (Chung, 1996). Structurally

similar proteins can have low sequence identities in the 8~10% range (the midnight zone,

Rost, 1997) and can still be identified with sensitive profile-profile based comparison, but

the RMSD can be as large as 3~6 _. The error largely comes form the misalignment from

sequence comparison. At such low sequence identity, comparison method that can detect

the remote homology as well as align the sequences close to the optimal from structurestructure alignment will be desirable.

Threading or fold recognition:

For evolutionally remotely related proteins, even if the sequence similarity is difficult to

detect with sequence comparison methods, there could still be identifiable structural

similarity. Structure alignments has been shown to be able to identify homologous

protein pairs with sequence similarities less than 10%. (Gerstein, 1998; Brenner, 1998;

Rost, 1997). When sequence comparison based methods are no longer sensitive enough

to recognize the correct fold for the target sequence, fold recognition or threading can

still be used to assign the correct fold to the target sequence.

Threading or fold recognition is the method by which a library of unique or

representative structures is searched for structure analogs to the target sequence, and is

based on the theory that there may be only a limited number of distinct protein folds. For

example, in an early paper, Chothia postulated that the number of unique protein folds

would be on the order of only about 1000 unique protein folds (Chothia, 1992). In

another estimation, the number of distinct domains and folds were placed around 7000

(Orengo et. al., 1994). Even though the number of new structures solved has been

increasing at an accelerated rate (close to 3000 structures solved in 2002), the proportion

of new folds, as determined by the CE algorithm (http://cl.sdsc.edu/ce.html), to the total

number of new structures solved in a given year decreased from an average of ca. 30% in

the 80’s steadily down to only ca. 8% in year 2001 (http://www.rcsb.org/pdb/holdings.html). It is reasonable to expect that as more and more protein structures are determined experimentally, we will be able to find close structure analogues in the databases of known structures for almost any protein sequence in the near future.

Threading or fold recognition involves similar steps as comparative modeling. The

difference is in the fold identification step. First of all, a structure library needs to be

defined. The library can include whole chains, domains, or even conserved protein cores.

Once the library is defined, the target sequence will be fitted to each library entry and a

energy function is used to evaluate the fit between the target sequence and the library

entries to determine the best possible templates. Depending on the algorithms to align the

target sequence with the folds and the energy functions to determine the best fits, the

threading methods can roughly be divided into four classes. (Jones, 2001)

(1) The earliest threading methods used the environment of each residue in the structure as the energy function and dynamical programming to evaluate the fit and the alignment

(Bowie, 1991).

(2) Instead of using overly simplified residual environment as the energy function, statistically derived pair wise interaction potentials (Sippl, 1990) between residue pairs or atom pairs can be used to evaluate the best possible fits between the target sequence and library folds (Jones, 1992). In this method, for efficient optimal alignment between the target sequence and the folds, the potential for residual i is obtained by summing over all the pair wise potentials involving i, and then “double dynamical programming” (Taylor, 1989; Jones, 1998) method can be used. (3) The third kind of methods does not use any explicit energy function at all. Instead, secondary structures and accessibility of each residue are predicted first and the target sequence and library folds are encoded into strings for the purpose of sequence-structure alignment.

(4) Finally, sequence similarity and threading can be combined for fold recognition. For

large-scale genome wise protein structure prediction, sequence similarity can be first used

for the initial alignments and the alignments can be evaluated by threading methods

(Jones, 1999).

The threading methods are limited by the high computational cost since each entry in the

whole library of thousands of possible folds needs to be aligned in all possible ways to

select the fold(s). Another major bottleneck is the energy function used for the evaluation

of the alignment. As these functions are drastically simplified for efficient evaluation, it

is not reasonable to expect to be able to find the correct folds in all cases with a single

form of energy function. Nevertheless, with the current functions, it is possible to reduce

the thousands of possible folds to only a few. Similar to the comparative modeling case,

for sequence similarities at protein family level, threading can produce alignments that

are accurate to 1 to 3 _, or in the case with low sequence similarity at the super-family

level, alignment at the range of 3 to 6_ can still be expected. As more protein structures

are determined and sequence comparison methods improve, more and more target

sequences fold assignment can be achieved by comparative modeling though.

Worth mentioning is the threading program PROSPECT (Xu, 2001), which performed

best in its category in the CASP4 competition. What is unique to PROSPECT is that it is

designed to find the globally optimal sequence-structure alignment for the given form of

energy function (Xu, 2000). The divide-and-conquer algorithm is used to speed up the

calculation by explicitly avoiding the conformation search space that is shown not to

contain the optimal alignment (Xu, 1998). In several cases that have sequence identity as

low as 17%, perfect sequence-structure alignment is still achieved for the alignable

potions between the target and template structures. Even in cases that no fold templates

exist for the target sequence, important features of the structure are still recognized

through threading the target sequence to the structures.

Ab Initio methods:

When no suitable structure templates can be found, Ab Initio methods can be used to

predict the protein structure from the sequence information only. Common to all Ab

Initio methods are:

1) Suitably defined protein representation and corresponding protein

conformation space in that representation;

2) Energy functions compatible with the protein representation;

3) Efficient and reliable algorithms to search the conformational space to minimize the energy function. The conformations that minimize the energy function are taken to be the structures that the protein is likely to adopt at native conditions. The folding of the protein sequence is ultimately dictated by the physical forces acting on the atoms of the protein and thus the most accurate way of formulating the protein folding or structure prediction problem is in terms of all-atom model subject to the physical forces. Unfortunately the complexity of such a representation makes the solution simply impossible with today’s computational capacity. For practical reasons, most Ab Initio prediction methods use reduced representations of the protein to limit the conformational space to manageable size and use empirical energy functions that capture the most important interactions that drive the folding of the protein sequence toward the native structures.

Currently, many Ab Initio methods can predict large contiguous segments of the protein to accuracy within 6_ of RMSD and there are several reviews that highlight the success and failure of the current Ab Initio methods. (Hardin, 2002 and references therein). The ROSETTA Ab Initio method performed better than the other Ab Initio methods in the recent CASP4 meeting and there are extensive literature (Bonneau, 2001; Simons, 2001; Bonneau, 2001) covering this method so we concentrate on a brief discussion of method used in ROSETTA. The ROSETTA method also illustrates many features and techniques that are common to the majority of the Ab Initio methods based on reduced representation of the protein and empirical potentials. Discussion of other methods with empirical potentials can be found in Hardin’s review. (Hardin, 2002)

The ROSETTA method, like many others, uses a reduced representation of the protein as

short segments. This representation can be attributed to the observation by Go (Go, 1983)

that local segments of the protein sequence have statistically important preferences for

specific local structures and that the tertiary structure has to be consistent with this

preference. In ROSETTA the protein is represented by short sequence segments and the

local structures they can adopt are assumed to be those found in all the known protein

structures. (Simons, 1997) The energy function is defined as the Bayesian probability of

structure/sequence matches and this forms the basis of the Monte Carlo sampling of the

reduced protein conformational space (Simons, 1997). The non-local potential, which

drives the protein toward compact folded structure, includes terms that favor paired

strands and buried hydrophobic residuals. The solvation effect can also be incorporated in

the energy function.

A problem intrinsic to the reduced representation of the protein and the simplified

empirical potential is that the energy function is not sensitive enough to differentiate the

correct native structures from conformations that are structurally close to the native state.

The energy landscape calculated from such energy functions will not be properly

funneled but flattened and caldera-like around the native structure. In fact, as the native

state is approached, the correlation between the calculated energy and the measure of

similarity between predicted and native structures are no longer valid. The usual practice

is then to produce a large number of decoy structures and then use various filtering and

clustering techniques to pick up the more native like structures. Filters can be used to

eliminate structures with poorly formed secondary structures and low contact orders

compared with that for sequences with compatible length (Bonneau 2001). The other

important technique is to use multiple sequences similar to the target sequence to

generate decoy structures. Structures thus generated usually form dense clusters that are

more compatible to the native structures of protein families of similar sequences than

those obtained from a single sequence only.

Many Ab Initio methods now can predict long segments of the protein sequence with

backbone atom RMSD less than 6 _. The predicted local structures are usually right, with

the correct contacts among residuals. One of the largest sources of errors was identified

to be in the contacts between distant residuals in the sequence as measured by the contact

order (CO). (Bonneau, 2002).

The most accurate and successful method so far has been comparative modeling based on sequence similarity comparison, especially when there exists a structure template with high

sequence identity to the target. One major progress in comparative modeling is the very

sensitive profile based sequence comparison method such as PSI-BLAST and profileprofile sequence comparison. Profile-profile based sequence comparison methods are usually superior in that such methods can pick up possible homologous structure templates even when the sequence identity is very low and that profile-profile comparison can align the sequence to the structure template more accurately, producing more accurate structure models. As more and more novel sequences are produced from the genome projects, the profile-based methods can be expected to become even more sensitive. Fold assignments that are traditionally accomplished from threading methods can be done with comparative modeling instead. On the other hand Ab Initio based methods can still be expected to play an important role in identifying new folds as the accuracy of these methods increase.

There is a wide range of possible applications for protein structure prediction, requiring

different accuracy of the predicted structures (Baker, 2001). For applications like

studying catalytic mechanisms and ligand docking in drug design, high accuracy

structures with RMSD within 1_ of the native structure is required. Low accuracy

structures with RMSD in the range of 1.5~3.5 _ for more than 80% of the sequence can

be used for tasks like fitting X-ray structures. Reliable functional annotation and active

site prediction can usually be achieved with accuracy of 4-8_ for over 80 amino acids,

which is well within the current capability of Ab Initio methods like Rosetta. When

structure templates with sequence identity over 50% can be found, the main chain atoms

can be modeled to 1_ RMSD, with the main error from the loop regions and side chains.

With sequence identities between 30%-50%, main chain accuracy of 1.5_ can be

expected. When the sequence identity is below 30%, the error in the aligned main chain

atoms can be estimated from the sequence difference. Simple linear relation has been

found between the structural difference and sequence difference if the sequence

difference is taken to be the average of that between the sets of sequences compatible to

the structures (Koehl 2002, Koehl 2002). A more serious problem for comparative

modeling in cases with low sequence identity is the false positives. With highly sensitive

profile-profile based methods, even if several structures may be identified to have

sufficient sequence similarity, it could happen that none of these structures is the correct

template for the target sequence. In fact, it has been shown that 95% of the sequence pairs

with 20~30% identity (the twilight zone) are not structurally similar (Rost, 1999). In such

cases the possible structure templates can be subject to further threading test for

validation.

Threading method has been shown to be able to avoid false positives

(Panchenko, 1999). Threading with the limited number of possible structure templates

avoid one of the computational bottlenecks in threading methods.

Further improvements in the predicted protein structures can be expected from several

fronts and I briefly discuss these possibilities.

(1) First and foremost, the largest improvements will come from more experimentally determined structures. As more and more protein structures are determined experimentally, it is conceivable that more and more target sequences will have compatible structures already deposited in the known structure database. This will increase not only the chance that the comparative modeling can assign the fold correctly but also the likelihood that the fold identified is more structurally similar to the target, thus increasing the accuracy of the structural model.

(2)Further improvement in the sequence-structure alignment can also improve the accuracy

of the structure model. The current sequence comparison methods can only align a fraction of the residuals that can be aligned in structure alignments (Sauder, 2000). Better-aligned residuals can undoubtedly improve the accuracy of the structure model. Probably there is a limit to which sequence comparison methods can align sequence to structure when the sequence identity is low. One possible way of improving sequencestructure alignment might be using threading based techniques to align the sequence to structures identified in comparative modeling. With better energy functions for evaluating the fit between sequence and target, this could be very effective.

(3) Refinements to the structure models generated from homology modeling, threading, or

even Ab Initio methods can be accomplished by molecular dynamics (MD) with accurate

all-atom physical potentials. The most severe obstacle of the application in MD in protein

structure prediction has been the long time it takes for the protein to fold from the completely unfolded states. This is probably due to the energy barriers encountered in the course of folding. If the simulation starts from the near native structures generated from the protein structure prediction methods, the MD simulation perhaps can reach the native structures much easily.

Protein design emphasis on structural Bioinformatics

There are many reasons to pursue the goal of protein design. In medicine and industry, the ability to precisely engineer protein hormones and enzymes to perform existing functions under a wider range of conditions, or to perform entirely new functions, has tremendous potential. Furthermore, in the case of rational protein design, the knowledge obtained is likely to be linked to a more complete understanding of the forces underlying protein folding, enabling more rapid interpretation of the wealth of genomic information being amassed. Advances in protein design may also make possible the construction of a range of other self organizing macromolecules. Although some steps have been taken towards the rational design of functional enzymes, such a goal lies some distance away. Currently, attention is focused on redesigning portions of proteins to insert particular motifs, increase stability or modify function. Examples include the engineering of metal-binding centers, reviewed recently by Hellinga, and the introduction of disulfide bonds. Theoretical work in the context of lattice models has also led to important insights. This work has been recently reviewed .

Attempts to design entire proteins de novo have been increasingly successful over the past decade. Early design efforts typically led to poorly characterizable states or molten globules, instead of a single target fold . Other difficulties became apparent when a designed a-helical dimer was shown to actually form a trimer. This and subsequent studies relied on largely qualitative examinations of the target molecule, making generalization to other targets difficult.

Energy expression

Atomistic protein design requires an energy expression or force-field to rank the desirability of each amino acid sequence for a particular backbone structure. Over the past decade, elements of a suitable energy expression for atomistic protein design have been suggested and explored. To avoid over-fitting and to focus on only the most important contributors, the energy expression should contain as few terms as possible while maintaining predictive power. Communication between theory and experiment is required to determine which energy terms to include, and the relative importance of the included terms. In a protein design cycle, an energy expression is used to generate sequences that are subsequently made in the laboratory. Alterations and additions to the energy expression are then considered which improve the correlation between the computed and experimentally determined properties of the sequences. The improved energy expression is then used to generate new sequences, completing the cycle.

Energy minimization

In order to experimentally test the energy expression, the minimum-energy sequence of the target backbone must be determined. In the simplest implementation, the energy of every possible sequence is calculated using the energy expression, and the lowest energy sequence is reported. The size of most problems of interest renders this exhaustive approach impractical. Ignoring the possibility of multiple conformations of each amino acid, allowing the 20 naturally occurring amino acids at every position of a 100 amino acid protein yields 10130 possible sequence solutions. Clearly, ingenious energy minimization techniques are necessary.

Published search algorithms, including self-consistent mean-field approaches, Monte Carlo techniques, neural networks and genetic algorithms, share the advantage of being able to sample large combinatorial space, but the disadvantage of not being guaranteed to find the global optimal solution. By contrast, dead-end elimination and branch-and-terminate (DB Gordon and SLM, unpublished data) are search algorithms that give a final solution that is guaranteed to be the global optimum, but which require the discretization of sidechain conformations into rotamers. Such requirements will be discussed below. Search algorithms have been recently reviewed.

Discretization of sidechain conformations

To place a reasonable limit on the complexity of the computation, the allowed sidechain conformations are typically chosen from a library of discrete possibilities, known as rotamers. This discretization is necessary for some efficient search algorithms to be applicable — in particular, the dead-end elimination theorem. Discretization of the sidechain conformations increases the likelihood of ‘false negative’ results. To be useful, atomistic protein design has only to output a subset of the sequences leading to the target fold, with simulation energies that correlate with their experimental stabilities. The simulation does not need to predict how well externally supplied sequences will fit the target fold. For example, the crystallographic

structure of the Streptococcal protein G B1 domain (GB1) shows Leu7 in an unusual conformation that does not appear in standard rotamer libraries. Therefore, an atomistic algorithm using such a library may not suggest leucine at position 7 in the top ranked sequences. The effect of the size of the rotamer library has also been considered; in general, the larger the library the better. If the library contains too many similar conformations of each amino acid, however, the energy landscape is flattened and energy minimization can be slow.

Residue classification

A reductionist approach to protein design, in which subsets of a protein are designed independently, has proven fruitful. Computational attempts to design protein cores date back many years. More recently, there have been attempts to design surfaces and boundary positions as well. The size of the design problem is reduced if only a subset of amino acid types need be considered in each of these three classes of residue positions. Protein cores are typically composed of hydrophobic amino acids, and protein surfaces are largely composed of hydrophilic amino acids, but the boundary residues must be selected from the full range of amino acids as these positions are observed to be both hydrophobic and hydrophilic. An automated way to classify residue positions is desirable, and a number of approaches have been described. The important components of the energy expression relevant to the core, surface and boundary will be discussed in the following sections.

The core

Early attention on the protein design problem focused on the generally hydrophobic cores of proteins. It is believed that the folding process is driven principally by hydrophobic collapse of the polypeptide, implying that a well designed hydrophobic core is crucial to the structure and stability of the protein. As might be expected, van der Waals forces (i.e., packing constraints) are crucial when designing the protein core. Models in which packing constraints are the only element of the energy expression are able to predict the stabilities of core mutations with high accuracy, when polar substitutions are not allowed. The importance of packing constraints can be determined by scaling the atomic van der Waals radii by a factor a. When a is varied to very high (>105%) or very low (<85%) values, implying too little or too much volume being packed into the available space, respectively, the resulting proteins exhibit unfolded or molten globule-like behavior. This is not surprising. Too much volume clearly requires the backbone to shift to accommodate the excess. Too little volume would either leave cavities in the core, which have been shown to destabilize proteins, or again force the backbone to shift to fill the cavity. When the protein backbone is significantly different from the model backbone, the model can no longer accurately predict the stability of the protein, and there may cease to be a single stable folded state. The optimal value of a was found to be 90%, implying that a slight over-packing of hydrophobic residues in the core can actually stabilize a designed protein. The benefit of using slightly diminished van der Waals radii can also be interpreted in terms of accommodating some backbone and rotamer flexibility. Consistent with the belief that the hydrophobic effect is a dominant cause of protein folding, the protein design cycle has been used to show that solvation effects also have an important role in the design of protein cores.

The hydrophobic effect is usually approximated as an energy benefit proportional to the amount of solventaccessible hydrophobic surface area that is buried upon folding. A penalty for burying polar area may also be included. Calculation of solvation energies is complicated by the need to construct the energy expression as a sum of two-body interactions. An entropic term has been tested, which may improve the correlation between predicted energy and biological activity. Such a term should in particular penalize methionine, as the loss of rotational freedom upon burial of this residue in a protein core can lead to destabilized proteins.

The surface

With the successful redesign of a range of protein cores, it is natural to consider the redesign of protein surfaces. Despite the incontrovertible role of the hydrophobic core in folding, the surface is also crucial to a protein’s structure and stability. The protein design cycle has been utilized to design surface sites, using as a starting point the energy expression determined from studies of protein cores. These studies showed the importance of electrostatics and hybridization-dependent hydrogen bonds . In the case of a-helical surfaces, no further energy terms are necessary to achieve good predictive ability. This is possibly because the sidechains that are better hydrogen-bond formers are also good a-helix formers, as quantified by a-helical propensity.

The above energy terms are not sufficient to design b-sheet surfaces, however. It may be necessary to directly bias the energy expression towards those sidechains with good b-sheet propensities. This is physically justifiable because common energy expressions do not otherwise include sidechain self-energies, which must at some level lead to propensities. It is also possible that a main source of b-sheet stability is to be found elsewhere, for example, in the hydrogen bonds that cause alignment with neighboring b strands. In the case of antiparallel b strands, the turn joining the two strands has an important role. Modifying the component residues of the turn can seriously affect protein stability. In the case

of noncontinuous strands, it has been suggested that small clusters of hydrophobic area on the surface may help to set the register. The hydrophobic effect may drive neighboring

strands to align in such a way as to bury as much of the exposed hydrophobic area as possible, for example, by covering it with long amphiphilic sidechains.

The boundary

Some residues cannot be easily classified as core or surface constituents. Depending on the sidechain orientation they can interact with either the core of the protein or with the solvent. One such example is Trp43 of GB1, which is predicted by modeling to rotate out into the solvent when nearby core residues are replaced with larger sidechains. Such unfavorable behavior can be attenuated by a hydrophobic exposure penalty.Recent work has shown that the design of boundary residues can lead to impressively enhanced stability. Just four boundary-site mutations in the 56-residue GB1 improve the stability from 3.3 kcal/mol to 7.1 kcal/mol at 50°C, converting a mesophilic protein into a hyperthermophilic protein.

Full de novo sequence design

To date there exists only a single example of a complete sequence calculation in which the structure of the designed protein was experimentally shown to achieve the design target. This calculation included one core position, seven boundary positions and 18 surface positions, leading to a total of 1027 possible sequence solutions. The success of this design effort underscores the power of computational approaches.

Backbone

Most atomistic protein design efforts require a fixed backbone. The calculation is performed under the assumption that the target backbone is precisely the backbone that will be achieved by the computed sequence. Fortunately, alterations in the backbone do not necessarily lead to large changes in the accessible sequence space. In one study, a 2 Å root mean square deviation (rmsd) in the backbone led to only a 0.5 Å rmsd in predicted sidechain conformations. Backbone flexibility can be modeled by using a softer van der Waals potential — in other words, giving the modeled atoms a fuzzy edge. This effect can be obtained by using reduced atomic radii, which has been shown to improve the stability of

designed proteins.

Protein backbone movements may be incorporated if the backbone is parameterizable, although to keep the calculation tractable, the number of sidechain rotamer combinations may be limited. A coiled-coil with righthanded superhelical twist, the backbone of which was necessarily designed de novo, has recently been reported, where 216 amino acid sequences were considered.

Negative design

The importance of negative design is the subject of much discussion. Recent work by Hellinga highlights the importance of this issue in computational protein design.

The inverse-folding design method determines the sequence of amino acids with the lowest energy when threaded onto the target backbone. It is conceivable that in some cases the computed sequence may actually prefer to fold to a different target structure, and that a sequence with a slightly higher computed energy would fold to the desired target. Unfortunately, knowledge of which structure will be adopted by the computed sequence requires a solution to the protein folding problem. Lattice models consisting of only two amino acid types can, however, be used to perform both sequence design and fold prediction. In this context, proposals to include non-thermodynamic potential functions aimed at addressing negative design issues have been developed. The hydrophobic exposure penalty is one example of negative design that improves predictive power. Despite the power of lattice model simulations, it has been suggested that the design procedure may be qualitatively different in such binary patterned systems.