Introduction to proteins
Proteins are polypeptides, which are made up of many
amino acids linked together as a linear chain. The structure of an amino acid
contains a amino group, a carboxyl group, and a R group which is usually carbon
based and gives the amino acid it's specific properties.These properties
determine the interactions between atoms and molecules, which are: van der
Waals force between temporary dipoles, ionic interactions between charged
groups, and attractions between polar groups.
Proteins form the very basis of life. They regulate
a variety of activities in all known organisms, from replication of the genetic
code to transporting oxygen, and are generally responsible for regulating the
cellular machinery and determining the phenotype of an organism. Proteins
accomplish their tasks in the body by three-dimensional tertiary and quaternary
interactions between various substrates. The functional properties depend upon
the proteins three-dimensional structure. The (3D) structures arise because
particular sequences of amino acids in a polypeptide chain fold to generate,
from linear chains, compact domains with specific structures. The folded
domains either serve as modules for larger assemblies or they provide specific
catalytic or binding sites.
Protein
Stability
As odd as it may seem native (folded) proteins are
only marginally stable under physiological conditions. Other forces such as
hydrophobic effects, electrostatic interactions, and hydrogen bonding act more
as stabilizing factors and are the main factors in driving the protein folding
process.
Hydrophobic effects cause the nonpolar substances to
minimize their contact with water, which is the major determinant of native
protein structure. The aggregation of the nonpolar side chains in the interior
of a protein is favored by the increase in entropy of the water molecules that
would otherwise form ordered cages around the hydrophobic groups.
In the interior of the protein where the molecules
are closely packed, van der Waals forces are relatively weak, but only act for
a short time because these forces are lost when the protein is unfolded.
Hydrogen bonding is important because proteins fold
in such a way to prevent hydrogen bonds, because the stabilizing energy of the
hydrogen bond would be lost when folding and unfolding occurred.
Largely the residues that occupy the interior of the
protein direct protein folding. Returning to its native conformation occurs
within a few seconds, which supports the idea that proteins have direct
pathways to arrive at the native state. As the protein folds its free energy
decrease which makes it a one-way process. The last stages of protein folding
depend on the specific sequence of the amino acids.
Thermodynamics
Entropy - The
disorder of matter. *Entropies of more complex molecules are larger then those
of simpler molecules, especially in a series of closely related compounds.
Second Law of
Thermodynamics - The total entropy of the universe is continually increasing.
Protein
stability depends basically in the free energy change between the folded and
unfolded states, which is expressed by,
-RT ln K = DG = DH - TDS
Where R
represents the Avogadro number, K, the equilibrium constant, G, the free energy
change between folded and unfolded, H, the enthalpy change and S, the entropy
change from folded to unfolded. The enthalpy change, H, corresponds to the
binding energy (dispersion forces, electrostatic interactions, van der Waals
potentials and hydrogen bonding) while hydrophobic interactions are described
by the entropy term, S. Proteins become more stable with increasing negative
values of, G. As the binding energy increases or the entropy difference between
the two states decreases, the folded protein becomes more stable.
The
thermodynamics of protein stability can best be model by the Energy Landscape
Theory. This describes where the energy of a protein is a function of the
topological arrangement of the atoms. A spatial surface with a very large
number of different co-ordinates and energy values separated by mountains and
ridges. Each value in this surface describes the protein in a specific
conformation, and there is an energy landscape for each state of the protein.
Protein identification
Recent advances in protein methods have led to the
application of mass spectrometry to the identification of proteins by Peptide
Mass Fingerprinting (PMF). In this
process, target proteins are isolated by SDS gel electrophoresis and are
digested directly in the gel slice with trypsin, other proteases or cleaving
chemicals. The resulting peptides are
extracted and the unfractionated mixture is analyzed by MALDI-TOF mass
spectrometry. The masses of the
resulting peptides are used to query a database of protein and DNA sequences
for likely candidate proteins. This is
accomplished by matching the measured masses with masses predicted from the
databases after 'virtual' digestion with the proteinase or chemical cleavage
agent. The more peptide masses that
match the predicted masses, the more certain one is of the likelihood that the
protein is identified. Clearly,
high-mass accuracy is required for this method to be of use. Sometimes no matches are found or the level
of certainty is too low. Working with
species whose genome is complete is a benefit in experiments of this nature.
Amounts of material required for the mass
fingerprinting approach are dependent upon the sample and its preparation. In our laboratory, with our validated
procedures, we are able to perform mass mapping/protein identification
experiments with as little as 2 pmol (loaded on the gel) of sample. However, more sample increases the chances of
a successful outcome. The success of any
experiment will depend upon reasonable sample estimates and very careful sample
handling. We recommend that you contact
us before planning any mass mapping experiments to help you estimate the amount
of protein and describe preferred handling techniques.
Several drawbacks to PMF need to be remembered when
planning your experiment. One is the
inability to specifically isolate the amino terminus of the protein. Therefore, amino acid sequence information
about this region of the molecule is often difficult to obtain using mass
spectrometry. Second, due to mass
similarities of amino acids it is often difficult to correctly identify amino
acids (especially during MS/MS or de novo experiments. For example, leucine and isoleucine have
isobaric masses and cannot be differentiated by PMFor MS/MS. Other amino acids (for example, Glutamine and
Lysine) have relatively small mass differences and can be mis-identified by
these methods.
Correct identification of a recombinant protein (and
sequence) can often be accomplished using mass spec and PMF alone, but
sometimes direct sequencing by Edman techniques is required. Therefore, it is our preferred approach to
combine the complementary techniques of mass spectrometric mapping and chemical
protein sequencing (Edman degradation) to identify proteins. This is especially true when you are working
in 'non-genome' species.
A detailed note can be obtained from this below site
structure and function determination
refer the following site
Structure comparison methods
refer the following site
Prediction of secondary structure from sequence
Expression presently supports nine different secondary structure
prediction algorithms. All computations are performed via the Network Protein Sequence Analysis server
(PBIL, France).
DPM
The DPM (Double Prediction Method) algorithm uses two approaches to
produce the final result - first it predicts the protein structural class and
then the secondary structure for the sequence (Deleage and Roux, 1987). The DPM
method can be divided into four steps:
- Prediction
of the structural class of a protein from AA composition (Nakashima et
al., 1986).
- Preliminary
secondary structure estimation from a simple algorithm.
- Comparison
between the two independent predictions.
- Optimisation
of parameters and determination of secondary structure.
DSC
DSC (Discrimination of protein Secondary structure Class) is based on
dividing secondary structure prediction into the basic concepts and then use of
simple and linear statistical methods to combine the concepts for prediction
(King and Sternberg, 1996). This makes the prediction method comprehensible and
allows the relative importance of the different sources of information used to
be measured.
At NPS@, a BLASTP search of your sequence is performed against the
SWISS-PROT database. These results are filtered and then aligned by CLUSTALW.
The resulting alignment is the input for DSC.
GORIV
GOR IV is the fourth version of GOR secondary structure prediction
methods based on the information theory (Garnier et al., 1996). There is no
defined decision constant. GOR IV uses all possible pair frequencies within the
window of 17 amino acid residues. After crossvalidation on a data base of 267
proteins, the version IV of GOR has a mean accuracy of 64.4% for a three state
prediction (Q3).
HNN
The HNN (Hierarchical Neural Network) prediction method can be seen as
an improvement on the famous classifier developed by Qian and Sejnowski, and
derived from the system NETtalk (Guermeur). As its predecessor, it is made up
of two networks: a sequence-to-structure network and a structure-to-structure
network. The prediction is thus only based on local information. The
improvements mainly deal with two points:
- Technical
tricks (recurrent connections, shared weights etc.) have been used to
increase the context on which the prediction is made and concomitantly
decrease by two orders of magnitude the number of parameters (weights).
- Physico-chemical
data have been explicitly incorporated in the predictors used by the
structure-to-structure network.
These modifications have significantly improved the error in generalization.
MLRC
MLRC (Multivariate Linear Regression Combination) is a secondary
structure prediction method which combines GOR4, SIMPA96 and SOPMA (Guermeur et
al., 1999). It post-processes the outputs of protein secondary structure
prediction methods and generates class posterior probability estimates.
Experimental results establish that it can increase the recognition rate of
methods that provide inhomogeneous scores, even if their individual prediction
successes are largely different.
Note: The MLRC algorithm may take several minutes to compute larger
sequences (>500 amino acids).
PHD
PHD are neural network systems (a sequence-to-structure level and a
structure-structure level) to predict secondary structure (PHDsec), relative
solvent accessibility (PHDacc) and transmembrane helices (PHDhtm) (Rost and
Sander, 1993). The NPS@ server only uses PHDsec. PHDsec focuses on predicting
hydrogen bonds. The procedure essentially involves executing a BLASTP search of
your sequence, filtering these results and aligning them with CLUSTALW, then
using the multiple alignment as the input of the neural network. The PHD
prediction done with NPS@ is better than the PHD prediction on the single
sequence. But it's not exactly the same and could be a little bit less accurate
than the PredictProtein one.
Note: The PHD algorithm may take several minutes to compute larger
sequences (>500 amino acids).
Predator
PREDATOR is a secondary structure prediction method based on recognition
of potentially hydrogen-bonded residues in a single amino acid sequence
(Frishman and Argos, 1996).
SIMPA96
SIMPA96 is a nearest neighbor secondary structure prediction method
(Levin, 1997). It's based on the homologue method described by Levin et al.
(1986).
SOMPA
SOPMA (Self-Optimized Prediction Method with Alignment) is based on the
homologue method of Levin et al. (1986). The improvement takes place in the
fact that SOPMA takes into account information from an alignment of sequences
belonging to the same family (Geourjon and Deleage, 1995).


