UNIT IV: PROTEOMICS I

Introduction to proteins


Proteins are polypeptides, which are made up of many amino acids linked together as a linear chain. The structure of an amino acid contains a amino group, a carboxyl group, and a R group which is usually carbon based and gives the amino acid it's specific properties.These properties determine the interactions between atoms and molecules, which are: van der Waals force between temporary dipoles, ionic interactions between charged groups, and attractions between polar groups.



Proteins form the very basis of life. They regulate a variety of activities in all known organisms, from replication of the genetic code to transporting oxygen, and are generally responsible for regulating the cellular machinery and determining the phenotype of an organism. Proteins accomplish their tasks in the body by three-dimensional tertiary and quaternary interactions between various substrates. The functional properties depend upon the proteins three-dimensional structure. The (3D) structures arise because particular sequences of amino acids in a polypeptide chain fold to generate, from linear chains, compact domains with specific structures. The folded domains either serve as modules for larger assemblies or they provide specific catalytic or binding sites.


Protein Stability

As odd as it may seem native (folded) proteins are only marginally stable under physiological conditions. Other forces such as hydrophobic effects, electrostatic interactions, and hydrogen bonding act more as stabilizing factors and are the main factors in driving the protein folding process.
Hydrophobic effects cause the nonpolar substances to minimize their contact with water, which is the major determinant of native protein structure. The aggregation of the nonpolar side chains in the interior of a protein is favored by the increase in entropy of the water molecules that would otherwise form ordered cages around the hydrophobic groups.

In the interior of the protein where the molecules are closely packed, van der Waals forces are relatively weak, but only act for a short time because these forces are lost when the protein is unfolded.
Hydrogen bonding is important because proteins fold in such a way to prevent hydrogen bonds, because the stabilizing energy of the hydrogen bond would be lost when folding and unfolding occurred.
Largely the residues that occupy the interior of the protein direct protein folding. Returning to its native conformation occurs within a few seconds, which supports the idea that proteins have direct pathways to arrive at the native state. As the protein folds its free energy decrease which makes it a one-way process. The last stages of protein folding depend on the specific sequence of the amino acids.

Thermodynamics

Entropy - The disorder of matter. *Entropies of more complex molecules are larger then those of simpler molecules, especially in a series of closely related compounds. 

Second Law of Thermodynamics - The total entropy of the universe is continually increasing.
Protein stability depends basically in the free energy change between the folded and unfolded states, which is expressed by,
-RT ln K = DG = DH - TDS
Where R represents the Avogadro number, K, the equilibrium constant, G, the free energy change between folded and unfolded, H, the enthalpy change and S, the entropy change from folded to unfolded. The enthalpy change, H, corresponds to the binding energy (dispersion forces, electrostatic interactions, van der Waals potentials and hydrogen bonding) while hydrophobic interactions are described by the entropy term, S. Proteins become more stable with increasing negative values of, G. As the binding energy increases or the entropy difference between the two states decreases, the folded protein becomes more stable.

The thermodynamics of protein stability can best be model by the Energy Landscape Theory. This describes where the energy of a protein is a function of the topological arrangement of the atoms. A spatial surface with a very large number of different co-ordinates and energy values separated by mountains and ridges. Each value in this surface describes the protein in a specific conformation, and there is an energy landscape for each state of the protein.


Protein identification

Recent advances in protein methods have led to the application of mass spectrometry to the identification of proteins by Peptide Mass Fingerprinting (PMF).  In this process, target proteins are isolated by SDS gel electrophoresis and are digested directly in the gel slice with trypsin, other proteases or cleaving chemicals.  The resulting peptides are extracted and the unfractionated mixture is analyzed by MALDI-TOF mass spectrometry.  The masses of the resulting peptides are used to query a database of protein and DNA sequences for likely candidate proteins.  This is accomplished by matching the measured masses with masses predicted from the databases after 'virtual' digestion with the proteinase or chemical cleavage agent.  The more peptide masses that match the predicted masses, the more certain one is of the likelihood that the protein is identified.  Clearly, high-mass accuracy is required for this method to be of use.  Sometimes no matches are found or the level of certainty is too low.  Working with species whose genome is complete is a benefit in experiments of this nature.
Amounts of material required for the mass fingerprinting approach are dependent upon the sample and its preparation.  In our laboratory, with our validated procedures, we are able to perform mass mapping/protein identification experiments with as little as 2 pmol (loaded on the gel) of sample.  However, more sample increases the chances of a successful outcome.  The success of any experiment will depend upon reasonable sample estimates and very careful sample handling.  We recommend that you contact us before planning any mass mapping experiments to help you estimate the amount of protein and describe preferred handling techniques.

Several drawbacks to PMF need to be remembered when planning your experiment.  One is the inability to specifically isolate the amino terminus of the protein.  Therefore, amino acid sequence information about this region of the molecule is often difficult to obtain using mass spectrometry.  Second, due to mass similarities of amino acids it is often difficult to correctly identify amino acids (especially during MS/MS or de novo experiments.  For example, leucine and isoleucine have isobaric masses and cannot be differentiated by PMFor MS/MS.  Other amino acids (for example, Glutamine and Lysine) have relatively small mass differences and can be mis-identified by these methods.

Correct identification of a recombinant protein (and sequence) can often be accomplished using mass spec and PMF alone, but sometimes direct sequencing by Edman techniques is required.  Therefore, it is our preferred approach to combine the complementary techniques of mass spectrometric mapping and chemical protein sequencing (Edman degradation) to identify proteins.  This is especially true when you are working in 'non-genome' species.

A detailed note can be obtained from this below site



structure and function determination

refer the following site




Structure comparison methods

refer the following site






Prediction of secondary structure from sequence


Expression presently supports nine different secondary structure prediction algorithms. All computations are performed via the Network Protein Sequence Analysis server (PBIL, France).
DPM
The DPM (Double Prediction Method) algorithm uses two approaches to produce the final result - first it predicts the protein structural class and then the secondary structure for the sequence (Deleage and Roux, 1987). The DPM method can be divided into four steps:
  • Prediction of the structural class of a protein from AA composition (Nakashima et al., 1986).
  • Preliminary secondary structure estimation from a simple algorithm.
  • Comparison between the two independent predictions.
  • Optimisation of parameters and determination of secondary structure.
DSC
DSC (Discrimination of protein Secondary structure Class) is based on dividing secondary structure prediction into the basic concepts and then use of simple and linear statistical methods to combine the concepts for prediction (King and Sternberg, 1996). This makes the prediction method comprehensible and allows the relative importance of the different sources of information used to be measured.
At NPS@, a BLASTP search of your sequence is performed against the SWISS-PROT database. These results are filtered and then aligned by CLUSTALW. The resulting alignment is the input for DSC.
GORIV
GOR IV is the fourth version of GOR secondary structure prediction methods based on the information theory (Garnier et al., 1996). There is no defined decision constant. GOR IV uses all possible pair frequencies within the window of 17 amino acid residues. After crossvalidation on a data base of 267 proteins, the version IV of GOR has a mean accuracy of 64.4% for a three state prediction (Q3).
HNN
The HNN (Hierarchical Neural Network) prediction method can be seen as an improvement on the famous classifier developed by Qian and Sejnowski, and derived from the system NETtalk (Guermeur). As its predecessor, it is made up of two networks: a sequence-to-structure network and a structure-to-structure network. The prediction is thus only based on local information. The improvements mainly deal with two points:
  • Technical tricks (recurrent connections, shared weights etc.) have been used to increase the context on which the prediction is made and concomitantly decrease by two orders of magnitude the number of parameters (weights).
  • Physico-chemical data have been explicitly incorporated in the predictors used by the structure-to-structure network.
These modifications have significantly improved the error in generalization.
MLRC
MLRC (Multivariate Linear Regression Combination) is a secondary structure prediction method which combines GOR4, SIMPA96 and SOPMA (Guermeur et al., 1999). It post-processes the outputs of protein secondary structure prediction methods and generates class posterior probability estimates. Experimental results establish that it can increase the recognition rate of methods that provide inhomogeneous scores, even if their individual prediction successes are largely different.
Note: The MLRC algorithm may take several minutes to compute larger sequences (>500 amino acids).
PHD
PHD are neural network systems (a sequence-to-structure level and a structure-structure level) to predict secondary structure (PHDsec), relative solvent accessibility (PHDacc) and transmembrane helices (PHDhtm) (Rost and Sander, 1993). The NPS@ server only uses PHDsec. PHDsec focuses on predicting hydrogen bonds. The procedure essentially involves executing a BLASTP search of your sequence, filtering these results and aligning them with CLUSTALW, then using the multiple alignment as the input of the neural network. The PHD prediction done with NPS@ is better than the PHD prediction on the single sequence. But it's not exactly the same and could be a little bit less accurate than the PredictProtein one.
Note: The PHD algorithm may take several minutes to compute larger sequences (>500 amino acids).
Predator
PREDATOR is a secondary structure prediction method based on recognition of potentially hydrogen-bonded residues in a single amino acid sequence (Frishman and Argos, 1996).
SIMPA96
SIMPA96 is a nearest neighbor secondary structure prediction method (Levin, 1997). It's based on the homologue method described by Levin et al. (1986).
SOMPA
SOPMA (Self-Optimized Prediction Method with Alignment) is based on the homologue method of Levin et al. (1986). The improvement takes place in the fact that SOPMA takes into account information from an alignment of sequences belonging to the same family (Geourjon and Deleage, 1995).