Protein homology modeling
Homology models are useful to get a rough idea where
the alpha carbons of key residues sit the folded protein. They can guide
mutagenesis experiments, or hypotheses about structure-function relationships.
Homology models are unreliable in predicting the conformations of insertions or
deletions, i.e. portions of the sequence that don't align with the sequence of
the template, as well as the details of sidechain positions.
Homology models
are unlikely to be useful in modeling ligand docking (drug design) unless the
sequence identity with the template is >70%, and even then, less reliable
than an empirical crystallographic or NMR structure.
SWISS-MODEL makes it quick and easy to submit a
target sequence and get back an automatically generated homology model, provided
an empirical structure with >30% sequence identity exists to use as a
template. (The template will be identified automatically, and the alignment
made automatically.) These automated models may be useful, but will sometimes
have errors that could be avoided if manual adjustments are made to the
sequence alignment by an expert. Learning to optimise your models manually
would take some time.
Suppose
you want to know the 3D structure of a target protein that has
not been solved
empirically by X-ray crystallography or NMR. You have only the
sequence. If an empirically determined 3D structure is available for a
sufficiently similar protein (50% or better sequence identity would be good),
you can use software that arranges the backbone of your sequence identically to
this template. This is called "homology modeling". It is,
at best, moderately accurate for the positions of alpha carbons in the 3D
structure, in regions where the sequence identity is high. It is inaccurate for
the details of sidechain positions, and for inserted loops with no matching
sequence in the solved structure.
A
homology modeling routine needs three items of input:
- The sequence of the protein
with unknown 3D structure, the "target sequence".
- A 3D template is
chosen by virtue of having the highest sequence identity with the target
sequence. The 3D structure of the template must be determined by reliable
empirical methods such as crystallography or NMR, and is typically a
published atomic coordinate "PDB" file from the Protein Data
Bank.
- An alignment between
the target sequence and the template sequence.
First, the homology modeling routine arranges the backbone identically
to that of the template. This means that not only the positions of alpha
carbons, but also the phi and psi angles and secondary structure, are made
identical to the template. Next, the more sophisticated homology modeling
packages adjust sidechain positions to minimize collisions, and may offer
further energy minimization or molecular dynamics in an attempt to improve the
model.
2. How
good can homology modeling be?
Two
proteins with a high level of sequence identity, and very similar secondary and
tertiary structure (identical "folds"), will nevertheless have not
exactly identical backbone conformations, even when determined under comparable
conditions. A homology model can be expected to differ from the real structure
to at least this extent. Overall differences in protein backbone structures are
quantitated with the root mean square deviation of the positions of alpha
carbons, or rmsd. "A model can be considered 'accurate enough'
or as 'accurate as you can get' when its rmsd is within the spread of
deviations observed for experimental structures displaying a similar sequence
identity level as the target and template sequences" (Schwede et
al., 3DCrunch). How big is this spread?
The 3DCrunch project used
the SWISS-MODEL routines to homology model all sequences in the Swiss-Prot database for which
appropriate templates exist. In the same project, in order to assess the
accuracy of homology modeling, 1,200 models were made for previously solved
structures (see Reliability
of models generated by SWISS-MODEL). This enabled comparisons of
homology models with empirical structures for the same sequence, where the
homology model was made using a template with the most similar sequence
available, other than the target sequence itself.
To
provide a frame of reference for rmsd values, note that up to 0.5 Å rmsd
of alpha carbons occurs in independent determinations of the same protein (Chothia and
Lesk, 1996). Proteins with 50% sequence identityhave on
average 1 Å rmsd ( Schwede et
al., 3DCrunch). The values given above are for X-ray
crystallographic determinations; NMR determinations have rmsd's several fold
higher.
If we
define a "highly successful homology model" as one having <=2
Å rmsd from the empirical structure, then the template must have >=60%
sequence identity with the target for a success rate >70%.
Even at high sequence identities (60%-95%), as many as one in ten
homology models have an rmsd >5 Å vs. the empirical structure.
Below 40% sequence identity, serious errors begin to appear more often. For the
complete distribution of results, see Reliability
of models generated by SWISS-MODEL
3. The
importance of the sequence alignment.
The
homology modeling routine will proceed to arrange the backbone of the target
sequence according to that of the template, using the sequence
alignment to decide where to position each residue. Therefore, the quality
of the sequence alignment is of crucial importance. Misplaced indels (gaps
representing insertions or deletions) will cause residues to be misplaced in
space. Although there are many routines that will do alignments automatically,
careful inspection and adjustment by someone with specialized training may
improve the quality of the alignment, and hence, of the homology model.
4.
Databases of Ready-Made Homology Models.
ModBase
is worth checking because if you find a model, it provides a PIR-formatted
sequence alignment ready to paste into Protein Explorer's MSA3D.
3DCrunch does not provide this. It might also be worth comparing models of the
same sequence from ModBase vs. SWISS-MODEL because they use different
algorithms.
It is
quicker and easier to submit your sequence to SWISS-MODEL than to
try to find a model in 3DCrunch, and you'll get the same "first
approach" results either way. 3DCrunch appears not to have been updated
since 1998, and only sequences in Swiss-Prot/TrEMBL were modeled, whereas you
can submit any sequence to SWISS-MODEL.
- ModBase (Andrej Sali et al., Rockefeller U, NY). Over 200,000 models, last updated July 2000. If your search finds models, click on the icon in the "Template-based view" column to get the model. If you find a model here the PIR alignment link will generate the alignment of the template with the target ready to paste into Protein Explorer's MSA3D. This will color the model by identity/similarity/difference from the template. Inserted loops are colored 'different'
- 3DCrunch (Manuel
Peitsch et al., GlaxoWellcome). 64,000 models made in 1998
from sequences in Swiss-Prot/TrEMBL using the SWISS-MODEL routines.
Particularly interesting are the control data,Reliability
of models generated by SWISS-MODEL.
- Homology
Modeling David R. Bevan, Virginia Tech.
- Professional
Gambling, R. Rodriguez, Gert Vriend, EMBL Heidelberg, Germany
(since 2000, Univ. Nijmegen, Netherlands).
- How to
evaluate the quality of a model, Torsten Schwede, Manuel C.
Peitsch & Nicolas Guex, ExPASy, Geneva, Switzerland.
- Molecular
Modeling for Beginners by Gale Rhodes, Univ. Southern
Maine, includes an introduction to DeepView, and a superb tutorial on
homology modeling (look through the left index frame for the link to Homology
Modeling).
This is the best starting place for beginners who
want to learn about homology modeling. It guides you through the use of NCBI Entrez
to find a sequence in the human genome, using SWISS-MODEL to get a homology
model, and most importantly, using DeepView to visualize and evaluate the
model.
- DeepView (also known as SwissPDBViewer) is an excellent free modeling program by Nicolas Guex, Alexandre Diemand, Torsten Schwede & Manuel C. Peitsch at GlaxoWellcome. DeepView resources are indexed at molvisindex.org. DeepView comes with a built-in tutorial on homology modeling. This tutorial walks you through the steps but does not explain in detail what the program is doing. The SWISS-MODEL homology modeling server returns a DeepView-ready PDB file, with the model and each template in a different layer. DeepView has automated routines to display the sequence alignment, adjust gap positions, show energetically unfavorable regions of the alignment, find and fix sidechain clashes. It is very powerful but the many keyboard shortcuts and hard-to-find options make it a challenge to use effectively on an occasional basis.
- SWISS-MODEL,
An Automated Comparative Protein Modelling Server, Torsten
Schwede, Manuel C. Peitsch & Nicolas Guex, ExPASy, Geneva,
Switzerland.
You just submit your sequence! It finds the best
template (if one exists), aligns the sequences, and returns the PDB file to you
automatically. You can choose whether to get back a 3D alignment of the model
with the template(s), or just the model.
- DeepView: Integrated with
SWISS-MODEL. See above under Tutorials.
- WHAT IF Web Interface (click
on Build/check/repair model). Roland Krause, Gert Vriend, Univ.
Nijmegen (in USA, say "Nigh-maygen"), Netherlands.
To use the WHAT IF model builder, you must choose
your template and prepare your alignment first.
The following opinion was sent to the Protein Data Bank Discussion Forum in
November, 1999 by Gert Vriend:
One of the goals of the WHAT IF homology modelling
module is to produce models that are as good as possible. Another goal is to
make errors as obvious as possible when they are unavoidable. Todays modelling
technology (which includes MD programs) cannot yet predict where a loop will
find its new position if it is disturbed by for example mutations or by binding
a ligand or sugar. In WHAT IF we therefore decided not to make a random motion
(and without insult meant to my friends in the MD world, optimising a mutated
loop by MD invariably looks like a random motion) but just to leave the
backbone as 'untouched' as possible. The results in the biannual CASP
competition are every round making more clear that this is (still) the best
strategy. However, not moving the backbone accounts for about 2/3-rd of the
total modelling error in WHAT IF's models.
- There are several other
homology modeling servers, but they appear less fully developed than the
two above.
Protein threading
Detailed ppt is on
Protein ab initio structure prediction
Introduction:
The biological role
of a protein is determined by its function, which is in turn largely
determined by its
structure. Thus there is enormous benefit in knowing the three
dimensional
structures of all the proteins. Although more and more structures are
determined
experimentally at an accelerated rate, it is simply not possible to determine
all
the protein
structures from experiments. As more and more protein sequences are
determined, there is
pressing need for predicting protein structures computationally.
Decades of intense
research in this area brought about huge progress in our ability to
predict protein
structures from sequences only. The protein structure prediction methods
can be broadly
divided into three categories:
1) homology modeling,
2) threading or fold
recognition, and
3) Ab Initio.
Essentially, the classification reflects the degree to which
different methods
utilize the information content available from the known structure
database. In the
following, I will briefly discuss each kind of methods and their accuracy,
applicability and
shortcomings. Possible improvements to protein structure prediction are
also discussed.
Comparative homology modeling:
So far protein
prediction methods based on homology have been the most successful.
Homology modeling is
based on the notion that new proteins evolve gradually from
existing ones by
amino acid substitution, addition, and/or deletion and that the 3D
structures and functions
are often strongly conserved during this process. Many proteins
thus share similar
functions and structures and there are usually strong sequence
similarities among
the structurally similar proteins. Strong sequence similarity often
indicates strong structure
similarity, although the opposite is not necessarily true.
Homology modeling
tries to identify structures similar to the target protein through
sequence comparison.
The quality of homology modeling depends on whether these
exists one or more
protein structures in the protein structure databases that show
significant sequence
similarity to the target sequence.
There are usually
four steps in homology based protein structure prediction methods:
(1)identify one or
more suitable structural templates from the known protein structure
databases;
(2) align the target
sequence to the structural template;
(3) build the
backbone
from the alignment,
including the loop region and any region that is significantly
different from the
template; and
(4) place the side-chains.
The first two steps, identification of structural templates and alignment of
the target sequence onto the parent structures, are usually related. Sequence
comparison methods determine sequence similarity by aligning the sequences
optimally. The aligned residuals of the structure templates are used to
construct the structural model in the second step.
The quality of the sequence
comparison thus not only determines whether a suitable structural template can be
found but also the quality of the alignment between the target sequence and the
parent structure, which in turn determines the accuracy of the structural
model. Of critical importance is the ability for the sequence comparison to
detect remote homologues and to correctly align the target sequence to and
parent structure. In the following discuss the various sequence comparison
methods in relation to homology modeling and their range of applicability,
accuracy and shortcomings.
For comparative
modeling, local sequence comparison methods are usually used since the
sequence similarity
is most likely over segments of the two sequences. The local
sequence comparison
can either be pair wise or profile based. Pair wise comparisons,
such as the widely
used BLAST (Altschul, 1990) in the early days, can detect sequence
similarities better
than 30%. A number of tools have also been developed to detect weak
homology
relationships. Methods like profile (Gribskov, 1987) and HMM (Krogh, 1996)
use a statistical
profile of a protein family. To further increase the chance of detecting
remote homologues,
PSI-BLAST (Altschul, 1997) and SAM-T98 (Karplus, 1998) build
the profile or HMM by
searching the database iteratively until no new hits are found.
Methods such as
PSI-BLAST encode the information about a whole protein family for
the target sequence
in a model to increase the chance of detecting remote homologies. To
further increase the
detection sensitivity, the sequences in the structure database can also
be encoded in
profiles. This forms the basis of the profile-profile based comparison
methods (Koehl,
2002). With low sequence identities (<20%), profile-profile methods
clearly outperform
the other two kinds of methods (Sauder, 2000): profile-profile
methods identified
more than 90% of homologous pairs, determined from structurestructure
similarity
comparison, with sequence identity better than 10% and an impressive 38% even
for cases with sequence identities between 5% and 9%.
The structure models
are constructed from the residuals of the structure template that are
aligned to the target
sequence in the sequence comparison. The quality of this alignment
thus is critical for
the accuracy achievable. The aligned residues from sequence
comparison are
generally different from that from structure-structure comparison though,
especially when the
sequence identity is low. To assess the ability of the sequence
comparison methods to
align the sequences correctly, it is instructive to compare the
sequence-sequence
alignment to the structure-structure alignment of the same pair of
proteins. To
determine how well the different similarity search methods can detect
remote homologies and
assess their ability in correctly aligning the sequences, Sauder et
al. (Sauder, 2000)
compared various sequence alignment methods to the CE (Shindyalov,
1998) structure
alignment of the SCOP (Murzin, 1995) protein structures. For sequence
identities less than
30%, profile-based comparison methods, such as PSI-BLAST and
profile-profile
comparison, are all obviously better than the pair wise BLAST method.
For example, at
10-15% sequence identity, BLAST aligns only 20% correctly while PSIBLAST and
profile-profile comparison can correctly align 40% and 48% respectively.
This also indicates
that there is still large room for improvement in correctly aligning the
target sequence to
the target structure.
One indication of the
accuracy of comparative modeling is the sequence identity between
the target and the
template. It is believed that if two protein sequences have 50% or
higher sequence
identity, then the RMSD of the alignable potion between the two
structures will
normally be less than 1_ (Gerstein, 1998). In the so-called “twilight zone”
(Doolittle, 1986),
with sequence identity between 20%~30%, 95% of the sequences with
this level of
identity have different structures though (Rost, 1999). When a structure
template can indeed
be found within the known protein structure databases in such cases,
the backbone RMSD can
be expected to be no better than 2_ (Chung, 1996). Structurally
similar proteins can
have low sequence identities in the 8~10% range (the midnight zone,
Rost, 1997) and can
still be identified with sensitive profile-profile based comparison, but
the RMSD can be as
large as 3~6 _. The error largely comes form the misalignment from
sequence comparison.
At such low sequence identity, comparison method that can detect
the remote homology
as well as align the sequences close to the optimal from structurestructure alignment
will be desirable.
Threading or fold recognition:
For evolutionally
remotely related proteins, even if the sequence similarity is difficult to
detect with sequence
comparison methods, there could still be identifiable structural
similarity. Structure
alignments has been shown to be able to identify homologous
protein pairs with
sequence similarities less than 10%. (Gerstein, 1998; Brenner, 1998;
Rost, 1997). When
sequence comparison based methods are no longer sensitive enough
to recognize the
correct fold for the target sequence, fold recognition or threading can
still be used to
assign the correct fold to the target sequence.
Threading or fold
recognition is the method by which a library of unique or
representative
structures is searched for structure analogs to the target sequence, and is
based on the theory
that there may be only a limited number of distinct protein folds. For
example, in an early
paper, Chothia postulated that the number of unique protein folds
would be on the order
of only about 1000 unique protein folds (Chothia, 1992). In
another estimation,
the number of distinct domains and folds were placed around 7000
(Orengo et. al.,
1994). Even though the number of new structures solved has been
increasing at an
accelerated rate (close to 3000 structures solved in 2002), the proportion
of new folds, as
determined by the CE algorithm (http://cl.sdsc.edu/ce.html),
to the total
number of new
structures solved in a given year decreased from an average of ca. 30% in
the 80’s steadily down
to only ca. 8% in year 2001 (http://www.rcsb.org/pdb/holdings.html). It is
reasonable to expect that as more and more protein structures are determined
experimentally, we will be able to find close structure analogues in the
databases of known structures for almost any protein sequence in the near
future.
Threading or fold
recognition involves similar steps as comparative modeling. The
difference is in the
fold identification step. First of all, a structure library needs to be
defined. The library
can include whole chains, domains, or even conserved protein cores.
Once the library is
defined, the target sequence will be fitted to each library entry and a
energy function is
used to evaluate the fit between the target sequence and the library
entries to determine
the best possible templates. Depending on the algorithms to align the
target sequence with
the folds and the energy functions to determine the best fits, the
threading methods can
roughly be divided into four classes. (Jones, 2001)
(1) The earliest
threading methods used the environment of each residue in the structure as the energy
function and dynamical programming to evaluate the fit and the alignment
(Bowie, 1991).
(2) Instead of using
overly simplified residual environment as the energy function, statistically
derived pair wise interaction potentials (Sippl, 1990) between residue pairs or
atom pairs can be used to evaluate the best possible fits between the target
sequence and library folds (Jones, 1992). In this method, for efficient optimal
alignment between the target sequence and the folds, the potential for residual
i is obtained by summing over all the pair wise potentials involving i,
and then “double dynamical programming” (Taylor, 1989; Jones, 1998) method can
be used. (3) The third kind of methods does not use any explicit energy function
at all. Instead, secondary structures and accessibility of each residue are
predicted first and the target sequence and library folds are encoded into
strings for the purpose of sequence-structure alignment.
(4) Finally, sequence
similarity and threading can be combined for fold recognition. For
large-scale genome
wise protein structure prediction, sequence similarity can be first used
for the initial
alignments and the alignments can be evaluated by threading methods
(Jones, 1999).
The threading methods
are limited by the high computational cost since each entry in the
whole library of
thousands of possible folds needs to be aligned in all possible ways to
select the fold(s).
Another major bottleneck is the energy function used for the evaluation
of the alignment. As
these functions are drastically simplified for efficient evaluation, it
is not reasonable to
expect to be able to find the correct folds in all cases with a single
form of energy
function. Nevertheless, with the current functions, it is possible to reduce
the thousands of
possible folds to only a few. Similar to the comparative modeling case,
for sequence
similarities at protein family level, threading can produce alignments that
are accurate to 1 to
3 _, or in the case with low sequence similarity at the super-family
level, alignment at
the range of 3 to 6_ can still be expected. As more protein structures
are determined and
sequence comparison methods improve, more and more target
sequences fold
assignment can be achieved by comparative modeling though.
Worth mentioning is
the threading program PROSPECT (Xu, 2001), which performed
best in its category
in the CASP4 competition. What is unique to PROSPECT is that it is
designed to find the
globally optimal sequence-structure alignment for the given form of
energy function (Xu,
2000). The divide-and-conquer algorithm is used to speed up the
calculation by
explicitly avoiding the conformation search space that is shown not to
contain the optimal
alignment (Xu, 1998). In several cases that have sequence identity as
low as 17%, perfect
sequence-structure alignment is still achieved for the alignable
potions between the
target and template structures. Even in cases that no fold templates
exist for the target
sequence, important features of the structure are still recognized
through threading the
target sequence to the structures.
Ab Initio methods:
When no suitable
structure templates can be found, Ab Initio methods can be used to
predict the protein
structure from the sequence information only. Common to all Ab
Initio methods are:
1) Suitably defined
protein representation and corresponding protein
conformation space in
that representation;
2) Energy functions
compatible with the protein representation;
3) Efficient and
reliable algorithms to search the conformational space to minimize the energy
function. The conformations that minimize the energy function are taken to be
the structures that the protein is likely to adopt at native conditions. The
folding of the protein sequence is ultimately dictated by the physical forces
acting on the atoms of the protein and thus the most accurate way of
formulating the protein folding or structure prediction problem is in terms of
all-atom model subject to the physical forces. Unfortunately the complexity of such
a representation makes the solution simply impossible with today’s
computational capacity. For practical reasons, most Ab Initio prediction
methods use reduced representations of the protein to limit the conformational
space to manageable size and use empirical energy functions that capture the
most important interactions that drive the folding of the protein sequence
toward the native structures.
Currently, many Ab
Initio methods can predict large contiguous segments of the protein to accuracy
within 6_ of RMSD and there are several reviews that highlight the success and
failure of the current Ab Initio methods. (Hardin, 2002 and references
therein). The ROSETTA Ab Initio method performed better than the other Ab Initio
methods in the recent CASP4 meeting and there are extensive literature
(Bonneau, 2001; Simons, 2001; Bonneau, 2001) covering this method so we
concentrate on a brief discussion of method used in ROSETTA. The ROSETTA method
also illustrates many features and techniques that are common to the majority
of the Ab Initio methods based on reduced representation of the protein and
empirical potentials. Discussion of other methods with empirical potentials can
be found in Hardin’s review. (Hardin, 2002)
The ROSETTA method,
like many others, uses a reduced representation of the protein as
short segments. This
representation can be attributed to the observation by Go (Go, 1983)
that local segments
of the protein sequence have statistically important preferences for
specific local
structures and that the tertiary structure has to be consistent with this
preference. In
ROSETTA the protein is represented by short sequence segments and the
local structures they
can adopt are assumed to be those found in all the known protein
structures. (Simons,
1997) The energy function is defined as the Bayesian probability of
structure/sequence
matches and this forms the basis of the Monte Carlo sampling of the
reduced protein
conformational space (Simons, 1997). The non-local potential, which
drives the protein
toward compact folded structure, includes terms that favor paired
strands and buried
hydrophobic residuals. The solvation effect can also be incorporated in
the energy function.
A problem intrinsic
to the reduced representation of the protein and the simplified
empirical potential
is that the energy function is not sensitive enough to differentiate the
correct native
structures from conformations that are structurally close to the native state.
The energy landscape
calculated from such energy functions will not be properly
funneled but
flattened and caldera-like around the native structure. In fact, as the native
state is approached,
the correlation between the calculated energy and the measure of
similarity between
predicted and native structures are no longer valid. The usual practice
is then to produce a
large number of decoy structures and then use various filtering and
clustering techniques
to pick up the more native like structures. Filters can be used to
eliminate structures
with poorly formed secondary structures and low contact orders
compared with that
for sequences with compatible length (Bonneau 2001). The other
important technique
is to use multiple sequences similar to the target sequence to
generate decoy
structures. Structures thus generated usually form dense clusters that are
more compatible to
the native structures of protein families of similar sequences than
those obtained from a
single sequence only.
Many Ab Initio
methods now can predict long segments of the protein sequence with
backbone atom RMSD
less than 6 _. The predicted local structures are usually right, with
the correct contacts
among residuals. One of the largest sources of errors was identified
to be in the contacts
between distant residuals in the sequence as measured by the contact
order (CO). (Bonneau,
2002).
The most accurate and
successful method so far has been comparative modeling based on sequence similarity
comparison, especially when there exists a structure template with high
sequence identity to
the target. One major progress in comparative modeling is the very
sensitive profile
based sequence comparison method such as PSI-BLAST and profileprofile sequence
comparison. Profile-profile based sequence comparison methods are usually
superior in that such methods can pick up possible homologous structure templates
even when the sequence identity is very low and that profile-profile comparison
can align the sequence to the structure template more accurately, producing more
accurate structure models. As more and more novel sequences are produced from the
genome projects, the profile-based methods can be expected to become even more sensitive.
Fold assignments that are traditionally accomplished from threading methods can
be done with comparative modeling instead. On the other hand Ab Initio based methods
can still be expected to play an important role in identifying new folds as the
accuracy of these methods increase.
There is a wide range
of possible applications for protein structure prediction, requiring
different accuracy of
the predicted structures (Baker, 2001). For applications like
studying catalytic
mechanisms and ligand docking in drug design, high accuracy
structures with RMSD
within 1_ of the native structure is required. Low accuracy
structures with RMSD
in the range of 1.5~3.5 _ for more than 80% of the sequence can
be used for tasks
like fitting X-ray structures. Reliable functional annotation and active
site prediction can
usually be achieved with accuracy of 4-8_ for over 80 amino acids,
which is well within
the current capability of Ab Initio methods like Rosetta. When
structure templates
with sequence identity over 50% can be found, the main chain atoms
can be modeled to 1_
RMSD, with the main error from the loop regions and side chains.
With sequence
identities between 30%-50%, main chain accuracy of 1.5_ can be
expected. When the
sequence identity is below 30%, the error in the aligned main chain
atoms can be
estimated from the sequence difference. Simple linear relation has been
found between the
structural difference and sequence difference if the sequence
difference is taken
to be the average of that between the sets of sequences compatible to
the structures (Koehl
2002, Koehl 2002). A more serious problem for comparative
modeling in cases
with low sequence identity is the false positives. With highly sensitive
profile-profile based
methods, even if several structures may be identified to have
sufficient sequence
similarity, it could happen that none of these structures is the correct
template for the
target sequence. In fact, it has been shown that 95% of the sequence pairs
with 20~30% identity
(the twilight zone) are not structurally similar (Rost, 1999). In such
cases the possible
structure templates can be subject to further threading test for
validation.
Threading method has
been shown to be able to avoid false positives
(Panchenko, 1999).
Threading with the limited number of possible structure templates
avoid one of the
computational bottlenecks in threading methods.
Further improvements
in the predicted protein structures can be expected from several
fronts and I briefly
discuss these possibilities.
(1) First and
foremost, the largest improvements will come from more experimentally determined
structures. As more and more protein structures are determined experimentally, it
is conceivable that more and more target sequences will have compatible
structures already deposited in the known structure database. This will
increase not only the chance that the comparative modeling can assign the fold
correctly but also the likelihood that the fold identified is more structurally
similar to the target, thus increasing the accuracy of the structural model.
(2)Further improvement
in the sequence-structure alignment can also improve the accuracy
of the structure
model. The current sequence comparison methods can only align a fraction of the
residuals that can be aligned in structure alignments (Sauder, 2000). Better-aligned
residuals can undoubtedly improve the accuracy of the structure model. Probably
there is a limit to which sequence comparison methods can align sequence to
structure when the sequence identity is low. One possible way of improving
sequencestructure alignment might be using threading based techniques to align
the sequence to structures identified in comparative modeling. With better
energy functions for evaluating the fit between sequence and target, this could
be very effective.
(3) Refinements to the
structure models generated from homology modeling, threading, or
even Ab Initio
methods can be accomplished by molecular dynamics (MD) with accurate
all-atom physical
potentials. The most severe obstacle of the application in MD in protein
structure prediction
has been the long time it takes for the protein to fold from the completely
unfolded states. This is probably due to the energy barriers encountered in the
course of folding. If the simulation starts from the near native structures
generated from the protein structure prediction methods, the MD simulation
perhaps can reach the native structures much easily.
Protein design emphasis on structural Bioinformatics
There are many reasons to pursue the goal of protein
design. In medicine and industry, the ability to precisely engineer protein
hormones and enzymes to perform existing functions under a wider range of
conditions, or to perform entirely new functions, has tremendous potential. Furthermore,
in the case of rational protein design, the knowledge obtained is likely to be
linked to a more complete understanding of the forces underlying protein
folding, enabling more rapid interpretation of the wealth of genomic information
being amassed. Advances in protein design may also make possible the
construction of a range of other self organizing macromolecules. Although some
steps have been taken towards the rational design of functional enzymes, such a
goal lies some distance away. Currently, attention is focused on redesigning portions
of proteins to insert particular motifs, increase stability or modify function.
Examples include the engineering of metal-binding centers, reviewed recently by
Hellinga, and the introduction of disulfide bonds. Theoretical work in the
context of lattice models has also led to important insights. This work has
been recently reviewed .
Attempts to design entire proteins de novo have
been increasingly successful over the past decade. Early design efforts
typically led to poorly characterizable states or molten globules, instead of a
single target fold . Other difficulties became apparent when a designed a-helical
dimer was shown to actually form a trimer. This and subsequent studies relied
on largely qualitative examinations of the target molecule, making generalization
to other targets difficult.
Energy
expression
Atomistic protein design requires
an energy expression or force-field to rank the desirability of each amino acid
sequence for a particular backbone structure. Over the past decade, elements of
a suitable energy expression for atomistic protein design have been suggested
and explored. To avoid over-fitting and to focus on only the most important contributors,
the energy expression should contain as few terms as possible while maintaining
predictive power. Communication between theory and experiment is required to determine
which energy terms to include, and the relative importance of the included terms.
In a protein design cycle, an energy expression is used to generate sequences
that are subsequently made in the laboratory. Alterations and additions to the
energy expression are then considered which improve the correlation between the
computed and experimentally determined properties of the sequences. The improved
energy expression is then used to generate new sequences, completing the cycle.
Energy
minimization
In order to experimentally test
the energy expression, the minimum-energy sequence of the target backbone must
be determined. In the simplest implementation, the energy of every possible
sequence is calculated using the energy expression, and the lowest energy
sequence is reported. The size of most problems of interest renders this
exhaustive approach impractical. Ignoring the possibility of multiple conformations
of each amino acid, allowing the 20 naturally occurring amino acids at every
position of a 100 amino acid protein yields 10130 possible sequence solutions.
Clearly, ingenious energy minimization techniques are necessary.
Published search algorithms,
including self-consistent mean-field approaches, Monte Carlo techniques, neural
networks and genetic algorithms, share the advantage of being able to sample
large combinatorial space, but the disadvantage of not being guaranteed to find
the global optimal solution. By contrast, dead-end elimination and
branch-and-terminate (DB Gordon and SLM, unpublished data) are search
algorithms that give a final solution that is guaranteed to be the global
optimum, but which require the discretization of sidechain conformations into
rotamers. Such requirements will be discussed below. Search algorithms have
been recently reviewed.
Discretization
of sidechain conformations
To place a reasonable limit on
the complexity of the computation, the allowed sidechain conformations are
typically chosen from a library of discrete possibilities, known as rotamers.
This discretization is necessary for some efficient search algorithms to be
applicable — in particular, the dead-end elimination theorem. Discretization of
the sidechain conformations increases the likelihood of ‘false negative’
results. To be useful, atomistic protein design has only to output a subset of
the sequences leading to the target fold, with simulation energies that
correlate with their experimental stabilities. The simulation does not need to
predict how well externally supplied sequences will fit the target fold. For
example, the crystallographic
structure of the Streptococcal
protein G B1 domain (GB1) shows Leu7 in
an unusual conformation that does not appear in standard rotamer libraries.
Therefore, an atomistic algorithm using such a library may not suggest leucine
at position 7 in the top ranked sequences. The effect of the size of the
rotamer library has also been considered; in general, the larger the library
the better. If the library contains too many similar conformations of each
amino acid, however, the energy landscape is flattened and energy minimization
can be slow.
Residue
classification
A reductionist approach to protein
design, in which subsets of a protein are designed independently, has proven
fruitful. Computational attempts to design protein cores date back many years.
More recently, there have been attempts to design surfaces and boundary
positions as well. The size of the design problem is reduced if only a subset of
amino acid types need be considered in each of these three classes of residue
positions. Protein cores are typically composed of hydrophobic amino acids, and
protein surfaces are largely composed of hydrophilic amino acids, but the
boundary residues must be selected from the full range of amino acids as these
positions are observed to be both hydrophobic and hydrophilic. An automated way
to classify residue positions is desirable, and a number of approaches have
been described. The important components of the energy expression relevant to
the core, surface and boundary will be discussed in the following sections.
The core
Early attention on the protein
design problem focused on the generally hydrophobic cores of proteins. It is
believed that the folding process is driven principally by hydrophobic collapse
of the polypeptide, implying that a well designed hydrophobic core is crucial
to the structure and stability of the protein. As might be expected, van der
Waals forces (i.e., packing constraints) are crucial when designing the protein
core. Models in which packing constraints are the only element of the energy
expression are able to predict the stabilities of core mutations with high accuracy,
when polar substitutions are not allowed. The importance of packing constraints
can be determined by scaling the atomic van der Waals radii by a factor a. When
a is varied to very high (>105%) or very low (<85%) values, implying too
little or too much volume being packed into the available space, respectively,
the resulting proteins exhibit unfolded or molten globule-like behavior. This
is not surprising. Too much volume clearly requires the backbone to shift to
accommodate the excess. Too little volume would either leave cavities in the
core, which have been shown to destabilize proteins, or again force the
backbone to shift to fill the cavity. When the protein backbone is
significantly different from the model backbone, the model can no longer
accurately predict the stability of the protein, and there may cease to be a
single stable folded state. The optimal value of a was found to be 90%,
implying that a slight over-packing of hydrophobic residues in the core can
actually stabilize a designed protein. The benefit of using slightly diminished
van der Waals radii can also be interpreted in terms of accommodating some
backbone and rotamer flexibility. Consistent with the belief that the
hydrophobic effect is a dominant cause of protein folding, the protein design cycle
has been used to show that solvation effects also have an important role in the
design of protein cores.
The hydrophobic effect is usually
approximated as an energy benefit proportional to the amount of
solventaccessible hydrophobic surface area that is buried upon folding. A
penalty for burying polar area may also be included. Calculation of solvation
energies is complicated by the need to construct the energy expression as a sum
of two-body interactions. An entropic term has been tested, which may improve
the correlation between predicted energy and biological activity. Such a term
should in particular penalize methionine, as the loss of rotational freedom upon
burial of this residue in a protein core can lead to destabilized proteins.
The surface
With the successful redesign of a
range of protein cores, it is natural to consider the redesign of protein
surfaces. Despite the incontrovertible role of the hydrophobic core in folding,
the surface is also crucial to a protein’s structure and stability. The protein
design cycle has been utilized to design surface sites, using as a starting
point the energy expression determined from studies of protein cores. These studies
showed the importance of electrostatics and hybridization-dependent hydrogen
bonds . In the case of a-helical surfaces, no further energy terms are
necessary to achieve good predictive ability. This is possibly because the
sidechains that are better hydrogen-bond formers are also good a-helix formers,
as quantified by a-helical propensity.
The above energy terms are not
sufficient to design b-sheet surfaces, however. It may be necessary to directly
bias the energy expression towards those sidechains with good b-sheet propensities.
This is physically justifiable because common energy expressions do not otherwise
include sidechain self-energies, which must at some level lead to propensities.
It is also possible that a main source of b-sheet stability is to be found
elsewhere, for example, in the hydrogen bonds that cause alignment with
neighboring b strands. In the case of antiparallel b strands, the turn joining
the two strands has an important role. Modifying the component residues of the turn
can seriously affect protein stability. In the case
of noncontinuous strands, it has
been suggested that small clusters of hydrophobic area on the surface may help
to set the register. The hydrophobic effect may drive neighboring
strands to align in such a way as
to bury as much of the exposed hydrophobic area as possible, for example, by covering
it with long amphiphilic sidechains.
The boundary
Some residues cannot be easily
classified as core or surface constituents. Depending on the sidechain orientation
they can interact with either the core of the protein or with the solvent. One such
example is Trp43 of GB1, which is predicted by modeling to rotate out into the
solvent when nearby core residues are replaced with larger sidechains. Such
unfavorable behavior can be attenuated by a hydrophobic exposure penalty.Recent
work has shown that the design of boundary residues can lead to impressively
enhanced stability. Just four boundary-site mutations in the 56-residue GB1
improve the stability from 3.3 kcal/mol to 7.1 kcal/mol at 50°C, converting a
mesophilic protein into a hyperthermophilic protein.
Full de
novo sequence design
To date there exists only a
single example of a complete sequence calculation in which the structure of the
designed protein was experimentally shown to achieve the design target.
This calculation included one core position, seven boundary positions and 18
surface positions, leading to a total of 1027 possible sequence solutions. The success
of this design effort underscores the power of computational approaches.
Backbone
Most atomistic protein design efforts
require a fixed backbone. The calculation is performed under the assumption that
the target backbone is precisely the backbone that will be achieved by the computed
sequence. Fortunately, alterations in the backbone do not necessarily lead to large
changes in the accessible sequence space. In one study, a 2 Å root mean square
deviation (rmsd) in the backbone led to only a 0.5 Å rmsd in predicted
sidechain conformations. Backbone flexibility can be modeled by using a softer
van der Waals potential — in other words, giving the modeled atoms a fuzzy
edge. This effect can be obtained by using reduced atomic radii, which has been
shown to improve the stability of
designed proteins.
Protein backbone movements may be
incorporated if the backbone is parameterizable, although to keep the calculation
tractable, the number of sidechain rotamer combinations may be limited. A
coiled-coil with righthanded superhelical twist, the backbone of which was necessarily
designed de novo, has recently been reported, where 216 amino acid
sequences were considered.
Negative design
The importance of negative design
is the subject of much discussion. Recent work by Hellinga highlights the importance
of this issue in computational protein design.
The inverse-folding design method
determines the sequence of amino acids with the lowest energy when threaded
onto the target backbone. It is conceivable that in some cases the computed
sequence may actually prefer to fold to a different target structure, and that
a sequence with a slightly higher computed energy would fold to the desired
target. Unfortunately, knowledge of which structure will be adopted by the
computed sequence requires a solution to the protein folding problem. Lattice
models consisting of only two amino acid types can, however, be used to perform
both sequence design and fold prediction. In this context, proposals to include
non-thermodynamic potential functions aimed at addressing negative design
issues have been developed. The hydrophobic exposure penalty is one example of
negative design that improves predictive power. Despite the power of lattice
model simulations, it has been suggested that the design procedure may be
qualitatively different in such binary patterned systems.