Introduction and principles
Today, genome sequencing projects are generating, with very high throughput capabilities, a huge amount of information in the form of nucleotide sequences. These sequences are being stored in specific databases. This information has to be analysed in order to complete its annotation. Annotation means description of the coding sequences, of the precursor elements, of the position of the DNA recognition sites, etc. In addition to this structural information, functional data has to be annotated, such as levels of expression of the genes, organ specificity, functionality of the genes, etc. The quality of this information represents a crucial aspect in the study of the expression products of the genes, i.e. the proteins.
Proteomics covers the study of the proteins expressed by a genome in a biological sample, such as an organism, an organ, an organelle, a biological fluid [1, 2]. This involves the identification of the proteins contained in this sample. The complete sequences of active proteins can only be partially deduced from their corresponding gene sequence. In fact, during and after the transcription and the translation processes leading to an active gene product, alterations often occur, such as alternative splicing, N-terminal truncation, post-translational modification (PTM)1'. Therefore, proteomics involves
1) Post-translational modifications are affecting the protein sequences between the time they are produced by the ribosomes and the time they are biologically active, that means after the translation. The amino acid sequence as obtained by a translation of the coding sequence of the corresponding gene can be modified in different ways. Beside alternative splicing and N-terminal truncation, some amino-acids of the proteins can be modified by the addition of chemical entities. Some of them are called homogeneous, as the added functional group is of discrete and unique mass. Phosphorilation is, in that respect, the addition of a phosphate group to the side chain of an amino-acid, such as typically serine or tyrosine. Others are methylation, sulfation, farnesylation, and so one. Others are named heterogeneous, as the chemical
Bioinformatics - From Genomes to Drugs. Volume II: Applications. Edited by Thomas Lengauer Copyright © 2002 WILEY-VCH Verlag GmbH, Weinheim ISBN: 3-527-29988-2
also the description of the events generating the modifications these proteins carry as functional entities.
In addition, the function and the activity of a protein depend on its concentration and more generally on its environment. To be active, a protein might have to interact with other proteins, with non-peptidic cofactors, with substrate molecules, at certain pH; it might have to be transported from a cell compartment to another, its PTMs can dynamically be controlled as function of the cell state, etc.
Proteomics has to deal with much higher complexity than genomics. Proteins occur in very variable concentrations: the dynamic range between transcription factors and albumin can be at least 1012. Single genes can be expressed in more than 20 protein forms in a single tissue: as an example, at least 22 different protein forms matching alpha-l-antitrypsin have been described in human plasma. The physico-chemical properties of proteins are extremely variable. Their molecular masses range from a few thousand to over a million Daltons. Solubility and isoelectric point2' (pi) can vary a lot in proteins, and their load of PTMs can be very heterogeneous. Therefore, the analysis and the understanding of the suspected half a million to one million proteins in human, expressed by a number of genes that is estimated to be around 40000, represent a real challenge, both from the technological and the bioinformatics point of view. Methods involving high-resolution protein separation, parallelization of sample preparation, automation of experimental processes and of database comparison, as well as powerful and specific visualization tools need to be developed and integrated.
The functions of the identified and then characterized proteins have also to be determined. Proteins often interact with each other in protein complexes, or with other important chemical entities such as DNA in proteinDNA interaction systems. Their activities involve also the interaction with smaller molecules such as cofactors or hormones. Therefore a number of methodologies have to be adapted to understand the behavior of proteins in their biological environment and the implications of a molecular dysfunction in a disease. The development of new diagnostic tests or the discovery group added can be made of chains of subgroups, and therefore can have very variable masses. These are glycosilation and lipidation. There are currently more than 200 different PTMs described in the literature and annotated in the SWISS-PROT protein database (http:// www.expasy.org/sprot/). The PTMs are functionally very important as, in a number of cases, they can trigger the activity of a protein, or they can help the binding of a protein to another one in a complex, or they can be of crucial importance in the folding process of a protein. This is particularly the case of the disulfide bonds, that bind covalently the sulfur atoms of two cysteine residues.
2) The protein isoelectric point (pi) value is the pH at which the sum of acidic amino-acid negative charges and of basic amino-acid positive charges equals zero. it ranges between less than 3 and more than 12.
Schematic representation of a proteomics analysis pathway.
Schematic representation of a proteomics analysis pathway.
of new drug therapies in the future will depend on the capacity to analyze complex systems. Proteomics offers the possibility to observe the global influence of a drug in a complex mixture of proteins. This will be possible only if one can describe the identity, the occurrence and the interaction of each individual component of this mixture.
The analysis of a proteome, described as the ensemble of the proteins expressed by a genome in a given tissue, for a given organism at a given time, requires to use and to combine a number of procedures, both experimental (wet-lab experiments) and bioinformatics (dry-lab experiments). Due to the chemical and physical complexity of proteomes, various methodological approaches have to be considered. Nevertheless, a consensus principle of proteome analysis can be described as in Figure 4.1. This linear pathway includes most of the wet- and dry-lab steps required for the complete analysis of a proteome.
First the sample is to be chosen in an adequate manner. It can be a raw biological fluid, a cell extract, a pre-fractionated sample, etc. The choice of the sample is crucial, as it is strongly dependent on the type of separation to be applied. The sample has to be compatible with the complexity both in number of components and in the dynamic range the separation can handle. Methodological aspects are discussed in Section 4.2 of this Chapter.
The proteins contained in this sample have to be separated. In proteomics, one-dimensional electrophoresis (1-DE) and mainly two-dimensional electrophoresis (2-DE) techniques are preferred. The technique will be described in Section 4.2 [3, 4].
- The next step is the analysis of the separation and the choice of the proteins to be identified. The separation result is visualised using image analysis software. This software allows for visualizing gel images, for comparing gel images, and for performing a number of comparative analyses that enable tracking statistically significant changes in protein expression between populations of gel/samples. It helps the user to choose protein spots of interest. The description of the computer analysis of gel images will be extended in Section 4.3.
- To proceed further, the selected proteins have to be submitted to a postseparation analysis. This step requires experimental determination of highly specific protein attributes, such as amino acid composition, sequence information, or peptide mass fingerprints following endopro-teolytic cleavage. In endoproteolytic cleavage the proteins are typically incubated with an enzyme that recognizes particular amino acids and cuts specifically the polypeptide chains. The reaction produces shorter peptides that are fragments of the so-called digested proteins. Today's standard procedure involves a protein digestion step with trypsin and the analysis of the obtained fragments generated with a crucial tool in pro-teomics, mass spectrometry, and more precisely Matrix Assisted Laser Desorption Ionization (MALDI) mass spectrometry or Electrospray Ionization (ESI) mass spectrometry3' .
- These protein attributes are then submitted to a database search. This search identifies a protein by looking at the best match between experimental data and data obtained by in-silico processing and "digestion" of a protein sequence database. The identification and characterization procedures using bioinformatics tools will be the topic of Section 4.4.
- Comprehensive sequence databases are a prerequisite for successful protein identification, and the identified protein data is in turn used to
3) MALDI and ESI refer to two different possibilities of ionising molecules in a mass spectrometer. In a MALDI-TOF mass spectrometer (MS), The analyte is mixed with a solution of usually an aromatic organic acid (a matrix solution). The dried mixture is apposed on a metal plate. A laser shoots on the formed crystals and its energy, that is thought to be transferred to the analyte through the matrix, ionises the target material. The ions are accelerated under an electric field in a vacuum tube. The record of the mass over charge value of the ions is made by measuring the time of flight (TOF) needed by the ions to reach the detector. In an ESI MS the analyte is reaching the machine in solution through a capillary. At the end of the capillary the droplets are dried and ionised under an electric field. The ions are accelerated in an analyzer, that can be a TOF, a quadrupole or an ion trap. A quadrupole is a system that deviates the ions under a perpendicular electric field on an ionizing detector. The deviation is function of the mass over charge of the ions and of the strength of the field. An ion-trap keep the generated ions in a small volume under an well controlled electro-magnetic field. They are then released specifically on a detector.
populate specific secondary databases. A short description of proteomics databases will be the topic of Section 4.5.
This linear representation of a proteome analysis pathway represents and orders the different steps required to perform the identification of proteins from crude biological sample. This pathway can be applied as a way to generate a systematic description of a proteome. The result of such an analysis is often concretized by the creation of an annotated database. Databases like SWISS-2DPAGE  and other federated 2-DE databases are typically populated with the result of systematic identification of all protein spots observed on a 2-DE image. From all possible combinations of methods currently available, a gold standard is widely adopted. It combines the very high separation power of 2-DE and the very high sensitivity and accuracy of mass spectrometry. However, while it is agreed that most of the proteins can be separated with this technique, this approach does not allow for identifying and characterizing all proteins present in a biological sample. Some proteins will not appear on the gel because of their extreme physico-chemical properties, such as exceedingly high or low molecular weight, exceedingly high pi, extreme hydrophobicity or insufficient abundance. Other techniques should therefore be applied to complete the picture, such as one-dimensional electrophoresis, pre-treatment of the starting material, pre-fractionation, early digestion of the protein mixture followed by RP-HPLC, and others. Some of these methods are described in .
Proteome analysis is however not limited to the systematic identification of all proteins present in a biological sample. The described pathway allows the creation of databases useful for being searched or compared with other real samples. This information can be used to compare various samples. In this case, the annotated 2-DE image is being compared with a non-annotated image in order to correlate positions and intensities of protein spots. From all the spots in the studied image only some spots of dedicated interest can then be chosen and further analysed with identification methods. A very widely used method to search for biological markers of specific diseases is the comparison of a statistically significant number of 2-DE images from samples of healthy and diseased patients or samples, treated or not treated with target drugs . The images are compared, clustered and a number of protein spots appear to be differently expressed. These become the spots of interest to be further identified. In this approach most of the effort is concentrated in the generation of the 2-DE gels and in the comparison of the generated images. Section 4.2 will discuss these aspects with an application in diabetes. Section 4.3 will describe the possibilities of proceeding such an analysis using dedicated software.
In addition, beyond the identification efforts, it might be of highest interest to describe and understand the modifications carried by the active gene products. This implies the search for splicing variants, amino acid mutations or PTMs. Mass spectrometry is very often used in a so-called MS/MS mode4' to unravel these difficulties [5, 8]. The interpretation of the spectra generated by this process reveals structural information on the amino acid sequence and the description of PTMs attached to the studied peptides. Besides mass spectrometry, other methods allow to highlight the presence of specific PTMs in proteins, such as immunological staining (also called immunoblotting). In this latter approach, antibodies specific to particular PTMs can be used to visualize the presence of PTMs in proteins separated by electrophoresis methods (see also below, in Section 4.2).
As the number of proteins to be identified in a single sample can reach a few hundreds to a few thousands, it becomes necessary to automate the whole identification process. The goal of automation is to yield a higher throughput as well as a better reproducibility of the experimental procedure. This requires to develop sample tracking procedures specific to proteomics, to create new bioinformatics tools to pipeline the data management, and to implement a data management system allowing to store the huge amount of information generated, and to be queried for later analyses. Section 4.6 will cover these automation aspects, together with the presentation of an alternative to the standard robotised approach of protein identification, known as a molecular scanner [9, 10, 11].
Was this article helpful?