Protein sequence databases

The most comprehensive source of protein information is found in protein sequence databases. These can be divided into universal databases, which store protein information from all types of biological sources, and specialised databases, which concentrate their efforts on restricted groups of protein families or organisms. Universal protein sequence databases can be categorised in databases that are simple repositories of sequence data, mostly translated from DNA sequences, and in annotated databases. The latter demand the assistance of biologists (''annotators'') who screen the original literature, review articles as well as electronic archives.

Here we will mainly describe SWISS-PROT, an annotated universal sequence database, and TrEMBL, an automatically generated sequence database with repository character, which supplements SWISS-PROT.

4.5.2.1 SWISS-PROT, TrEMBL and their integration in proteomics

SWISS-PROT [25, http://www.expasy.org/sprot/] is a curated protein sequence database which provides a high level of annotation, such as the description of a protein's function, its domain structure, PTMs, conflicts between literature references and variants. It also provides a minimal level of redundancy, a high level of integration with other biomolecular databases, and an extensive external documentation. Created in 1986, it now includes more than 100 000 entries from more than 7000 different species. SWISS-PROT is available from more than 100 WWW servers world-wide; and its main host is the ExPASy server.

Each entry consists of different line types (Figure 4.10). A first section represents the description part. It includes, among others, the accession number (a unique entry identifier), the update dates, the protein description (its ''name''), the gene name, and the taxonomic origin. This information heading is then followed by a reference section. Each of the reported references includes the type of experimental work contributing to the entry (sequencing, 3-D structure determination, mutagenesis studies, etc.), the author list, and the literature references. Entries are not only populated by literature work: a continuously increasing number of sequences are directly submitted to the database and published this way. The following section is a comment section. Here a variety of textual comments, classified into topics are listed. Such topics include function, subunit, similarity, PTM, MS source, etc. This section is followed by the database cross-reference section, which provides active links to other biological databases. In this way, SWISS-PROT is currently linked to more than 30 different databases. After a keyword field, the next section is the so-called feature table. It describes regions or sites of interest in the sequence and documents PTMs, binding sites, active sites, secondary structures, variants, conflicts, and the like. The entry ends with the amino acid sequence itself. It is important to note that the sequence shown is that of the unprocessed precursor of the protein, before post-translational modification and processing. Tools that identify proteins by using isoelectric point (pI), molecular weight (Mw) or peptide fingerprints should ideally make use of the information held within the

ID ACTG_HUMAN STANDARD; PRT; 375 AA.

AC P02571; P14104; P99022;

DT 21-JUL-1986 (Rel. 01, Created)

DT 21-JUL-1986 (Rel. 01, Last sequence update)

DT 15-JUL-1999 (Rel. 38, Last annotation update)

DE Actin, cytoplasmic 2 (Gamma-actin).

GN ACTG1 OR ACTG.

OS Homo sapiens (Human), Mus musculus (Mouse), Rattus norveqicus (Rat),

OS Bos taurus (Bovine), Trichosurus vulpecula (Brush-tailed possum), and

OS Anser anser anser (Western gravlag goose) .

OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;

OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.

RP SEQUENCE FROM N.A.

RC SPECIES=Human;

RX MEDLINE=88246448 [NCBI, ExPASy, Israel, Japan); PubMed=2837653;

RA Erba H. P. , Eddy R. , Shows T., Kedes h.. Gunning P. ,-

RT "Structure, chromosome location, and expression of the human gammaRT actin gene: differential evolution, location, and expression of the RT cytoskeletal beta-and gamma-actin genes."; RL Mol. Cell. Biol. 8:1775-1789(1988). RN [2]

RP SEQUENCE FROM N.A.

CC -'.- FUNCTION: ACTINS ARE HIGHLY CONSERVED PROTEINS THAT ARE INVOLVED

CC IN VARIOUS TYPES OF CELL MOTILITY AND ARE UBIQUITOUSLY EXPRESSED

CC IN ALL EUKARYOTIC CELLS.

CC -!- SUBUNIT: POLYMERIZATION OF GLOBULAR ACTIN (G-ACTIN) LEADS TO A

CC STRUCTURAL FILAMENT (F-ACTIN) IN THE FORM OF A TWO-STRANDED

CC HELIX. EACH ACTIN CAN BIND TO 4 OTHERS.

CC -!- SUBCELLULAR LOCATION: CYTOPLASMIC.

CC -!- MISCELLANEOUS: IN VERTEBRATES 3 MAIN GROUPS OF ACTIN ISOFORMS,

CC ALPHA, BETA AND GAMMA HAVE BEEN IDENTIFIED. THE ALPHA ACTINS ARE

CC FOUND IN MUSCLE TISSUES AND ARE A MAJOR CONSTITUENT OF THE

CC CONTRACTILE APPARATUS. THE BETA AND GAMMA ACTINS CO-EXIST IN MOST

CC CELL TYPES AS COMPONENTS OF THE CYTOSKELETON AND AS MEDIATORS OF

CC INTERNAL CELL MOTILITY.

CC -!- SIMILARITY: BELONGS TO THE ACTIN FAMILY.

Example of a SWISS-PROT entry in its textual format. "..." indicates cuts in the entry. Underlined text is associated with hyperlinks on the ExPASy server.

feature table. One can then achieve an optimal approximation of the protein in its mature state. Hence, before computing pi and Mw, the signal sequence should be removed. On the ExPASy server, SWISS-PROT (and TrEMBL; see below) entries can be accessed either in their original format, with html links added (e.g. http://www.expasy.org/cgi-bin/get-sprot-entry? P00750), or in a more user-friendly visualisation mode, called "NiceProt View" (e.g. http://www.expasy.org/cgi-bin/niceprot.pl?P00750).

On the one hand, genome sequencing projects are generating a dramatically increasing number of sequences to be incorporated in SWISS-PROT. On the other, the number of annotators, who screen literature and sequence databases to populate SWISS-PROT with the high quality annotations, is limited. However, despite the increase in available raw sequence data it was not judged appropriate to automatically populate SWISS-PROT with data of lower quality standards. This is where TrEMBL (Translation of EMBL

DR EMBL; M19283; AAA51579.1; -. [EMBL / GenBank / DDBJ1 [CoDinqSequence]

DR EMBL; M26111; AAA49315.1; [EMBL / GenBank / DDBJ] [CoDinqSequence]

DR PIR; A28098; ATHUG.

DR PIR; A55001; A55001.

DR HSSP; P02570; 2BTF. fHSSP ENTRY / SWISS-3DIMAGE / PDB1

DR SWISS-2DPAGE; P99022; MOUSE.

DR GeneCards; ACTG1.

DR MGD; MGI:87906; Actq.

DR INTERPRO; IPR000279; -.

DR PFAM; PF00022; actin; 1.

DR PRINTS; PR00190; ACTIN.

DR PROSITE; PS00406; ACTINS1; 1.

DR PROSITE; PS00432; ACTINS 2; 1.

DR PROSITE; PS01132; ACTINS ACT LIKE; 1.

DR PRODOM [Domain structure / List of seq. sharing at least 1 domain!

DR BLOCKS; .

DR PROTOMAP; .

DR PRESAGE; P02571.

DR DIP; P02571.

KW Multiqene family; Structural protein; Methylation; Acetylation;

KW Cytoskeleton.

FT

PROPEP

1

1

REMOVED IN MATURE

FORM.

FT

CHAIN

2

375

ACTIN, CYTOPLASMIC

2.

FT

MOD RES

1

1

ACETYLATION.

FT

MOD RES

2

2

ACETYLATION.

FT

MOD RES

73

73

METHYLATION.

FT

CONFLICT

316

316

E -> K (IN REF.

3)

FT

CONFLICT

344

344

S -> F (IN REF.

3)

SQ

SEQUENCE

375 AA;

41793

MW; 54D08F986964EFD5

CRC64

MEEEIAALVI DNGSGMCKAG FAGDDAPRAV FPSIVGRPRH QGVMVGMGQK DSYVGDEAQS KRGILTLKYP IEHGIVTNWD DMEKIWHHTF YNELRVAPEE HPVLLTEAPL NPKANREKMT QIMFETFNTP AMYVAIQAVL SLYASGRTTG IVMDSGDGVT HTVPIYEGYA LPHAILRLDL AGRDLTDYLM KILTERGYSF TTTAEREIVR DIKEKLCYVA LDFEQEMATA ASSSSLEKSY ELPDGQVITI GNERFRCPEA LFQPSFLGME SCGIHETTFN SIMKCDVDIR KDLYANTVLS GGTTMYPGIA DRMQKEITAL APSTMKIKII APPERKYSVW IGGSILASLS TFQQMWISKQ EYDESGPSIV HRKCF

Fig. 4.10 (continued)

Nucleotide Sequence Database [26]) steps in. TrEMBL was created in 1996 and consists of computer-annotated entries in SWISS-PROT-like format. It is populated by protein sequences translated from the coding sequences (CDS) in EMBL and is a supplement to SWISS-PROT. In a way, it can be considered as a preliminary section of SWISS-PROT; indeed, once the manual annotation is performed, the entries move on to SWISS-PROT.

Apart from text searches, both SWISS-PROT and TrEMBL can be queried by sequence analysis tools as well as protein identification and characterization tools. Tools like PeptIdent or FindMod (see Section 4.4) take advantage of the annotation of SWISS-PROT/TrEMBL to improve their capacities of identifying and characterizing active chains and proteins annotated to be post-translationally modified. This is better than basing their computations and predictions merely on the precursor protein sequences.

For many applications, such as protein identification or sequence similarity searches, it is necessary to use a database as complete and nonredundant as possible. SWISS-PROT, and to some extent TrEMBL, aim to be non-redundant by maintaining only one consensus sequence record for each protein - even if several copies or subfragments of the same protein have been sequenced. Polymorphisms (sequence variants), alternative splicing events, and sequencing conflicts are documented in the feature table of the retained copy. As a consequence, so-called ''non-redundant'' databases, which use strict sequence matches as exclusion criteria, contain a higher number of entries than the combination of SWISS-PROT + TrEMBL. The drawback, though, is that they contain multiple copies of the same protein and are thus not really non-redundant. Examples are OWL, and nrdb (Non-Redundant DataBase provided by the NCBI, built from SWISS-PROT and GenPept, an automatic translation of the GenBank nucleotide sequence database).

The links provided by the SWISS-PROT entries to other proteomics databases like SWISS-2DPAGE, PROSITE/InterPro and about 30 other databases allow for rapid access to experimental proteomic data, like position and number of protein spots on a 2-D gel, other members of the same family, or the 3-D structure of the protein, etc.

4.5.2.2 Specialized protein sequence databases

There are many such databases available. Their contents vary a lot in terms of the range of interest, number of entries and type of information. As it is impossible to be exhaustive here, we recommend readers to consult the updated list on the World-Wide Web, maintained by Amos Bairoch on the ExPASy server (http://www.expasy.org/alinks.html).

Nucleotide sequence databases

Nucleotide sequence databases are the main source of sequence information for protein sequence databases. There is an international collaboration for the production, the maintenance and the distribution of nucleotide sequence information. The organisations involved are the European Bioinfor-matics Institute (EBI) in Great Britain distributing EMBL Nucleotide Sequence Database, the National Centre for Biotechnology and Information (NCBI) in the USA distributing GenBankā„¢ [56], and the National Institute of Genetics in Japan distributing DDBJ (DNA Database of Japan) [57]. In the scope of this collaboration, EMBL, GenBank and DDBJ all contain the same information - the nucleotide sequence database is in fact one single database distributed under three different names, and in three different formats.

EMBL/GenBank/DDBJ is an archive or repository database. In that respect, no attempt is made to reduce the size of the database by merging entries submitted by different groups and which correspond to the same stretches of genome or mRNA. As stated above and for historical reasons, their format is not unified, but they share the same organizational principles as described for SWISS-PROT, i.e. a header containing the name of the sequence, the species of origin, followed by the references, a feature table, and the sequence data.

The nucleotide sequence databases can be screened by a number of sequence similarity tools and some protein identification applications, such as SEQUEST [58]. As it is difficult to predict coding sequences from raw genomic data, and in some cases the authors have not submitted the information to the database, the tools propose protein hits from translated raw DNA sequences.

In addition, a powerful approach to search DNA sequences exclusively in their coding region is to search databases populated with Expressed Sequence Tags (ESTs) [59]. ESTs are short nucleotide sequence of 250-400 base pairs generated by random, single-pass sequencing of cDNA libraries. In September 2000, public EST databases contained more than 5.6 million entries, about 40% of which were human sequences. They are highly redundant and contain many sequencing errors. However, in combination with expert analysis or searches in other sequence databases, they provide a highly valuable source of proteomic information and can be searched with mass spectrometric fragmentation data [40, 60].

Databases for protein families, domains and functional sites: InterPro

Given the rapid release of raw, uncharacterized sequence data from genome sequencing projects, there is an increasing need for automatic methods to identify protein family traits and to assign functions to unknown sequences. A number of signature databases are available, built around the idea that certain regions which are vital for a protein function are usually well conserved. An analysis method is chosen which allows to describe the conserved motif, fingerprint or signature in a way that enables software tools to scan against unknown protein sequences in order to identify, rapidly and reliably to which known protein family (if any) a new sequence belongs. Such databases are: PROSITE [52] for regular expression patterns and profiles, Pfam [61] for hidden Markov models (HMMs), PRINTS [62] for fingerprints (groups of aligned, unweighted motifs) and Blocks [63], for aligned weighted motifs or blocks. ProDom [64] is an automatic compilation of homologous domains, derived by clustering of SWISS-PROT and TrEMBL. As each of the above-mentioned databases has its strong points and its weaknesses, it became increasingly difficult for scientists to perform a comprehensive and reliable functional analysis of their unknown sequences. Indeed, each database has its own format, search tools and search outputs. In an attempt to address these difficulties, the authors of the sig nature databases cited above formed the InterPro consortium and developed InterPro [65], an integrated documentation resource for protein families, domains and functional sites. InterPro is accessible for text- and sequence-based searches at http://www.ebi.ac.uk/interpro/, thus providing an integrated view of all member databases and an intuitive search interface.

2-DE databases

2-D PAGE databases, along with PTM databases (see sub-section 4.5.6) are among the most recent proteomic databases available on the web. They are also pioneers in data integration on the WWW. They collect information on proteins identified on 2-DE and consist of two major components: image data and textual information. The image data are a collection of reference gel maps. More than one map per biological sample may exist. Such an image is the representation of a stained gel scanned optically. Apparent spots represent the position of focused protein forms and are linked to the textual information component of the database. The latter includes, essentially, data on apparent pI and Mw of the spots, the name and description of the protein, the identification method, bibliographical references, and cross-references to SWISS-PROT and other databases. Access to the data and navigation within the databases are performed through simple operations, such as clicking on a spot in an image, selecting a spot/protein identifier from a list, or keyword searches.

4.5.5.1 SWISS-2DPACE

The SWISS-2DPAGE [20, 6] database was created and is maintained at the Swiss Institute of Bioinformatics in collaboration with the University Hospital of Geneva. In March 2000, it contained 26 reference maps from human, mouse, Saccharomyces cerevisiae, Escherichia coli and Dictyostelium discoideum. More than 700 entries document over 3000 identified ''spots''. The proteins represented by those spots were identified either by matching with other gels, by amino acid composition, by Edman sequencing, by im-munoblotting, and recently over 700 of them were identified by mass spec-trometry. The text format is similar to the one in SWISS-PROT. It includes specific fields, such as the type of master gel from which the protein spot has been identified, the list of gel images associated with the entry, as well as other 2-DE specific data, such as the mapping procedure, the spot identifier, the experimental pI and Mw, the peptide mass fingerprint and the amino-acid composition - if experimentally determined. The database can be accessed from SWISS-PROT by a hyperlink anchored to a cross-reference. If no identified spot exists in SWISS-2DPAGE for a given entry in

SWISS-PROT, an image is generated that highlights the theoretical position of the corresponding protein.

4.5.5.2 Other 2-D PACE databases

The number of 2-D PAGE databases is continuously increasing. An up-to-date list can be found in WORLD-2DPAGE (http://www.expasy.org/ch2d/ 2d-index.html), an index of 2-D PAGE databases and services. More than 25 species are represented in about 100 2-D maps all over the world. The databases are established in various formats. However, an increasing number of groups have joined the principle of federated 2-DE databases [24], according to which the organization of and access to a database must comply with five rules.

Post-translational modification databases

In an era where more and more complete genomes are being sequenced, the challenge of understanding proteins and proteomes is moving towards the understanding protein modifications that cannot predicted from the nucleic acid sequence. Most proteins are indeed the target of PTMs and are not functional unless they are modified. While SWISS-PROT, as a universal database, places a considerable emphasis on the documentation of post-translational modification within the sequence records, several specialized databases have been set up recently to cater for this emerging field.

RESID [66] is a general database of protein structure modifications (http://www-nbrf.georgetown.edu/pirwww/dbinfo/resid.html), maintained by the National Biomedical Research Foundation in the USA and the Protein Information Resource group (PIR). The database contains descriptive, chemical, structural, and bibliographical information on 283 (Release 22.1, July 2000) types of modified amino-acid residues. Apart from text-based searches, RESID can also be queried by molecular weight: an average or monoisotopic mass can be entered (together with a mass variance) to search for all modified amino acid residues in the database with masses similar to the input mass.

Apart from this general PTM database, some databases specialized in one particular type of PTM have been created. GlycoSuiteDB (http://www. glycosuite.com) is an annotated database of glycan structures. The database is provided by Proteomesystems Ltd and contains information about most published O- and N-linked glycans. If glycan structures are known to be attached to proteins, cross-references to SWISS-PROT/TrEMBL are given. GlycoSuiteDB can be queried by mass, by attached protein, by oligosaccharide composition or different modes of textual queries (taxonomy, biological source, etc.).

Two more examples of PTM databases are O-GLYCBASE [67], a database of O-glycosylated proteins (http://www.cbs.dtu.dk/databases/OGLYCBASE/), and PhosphoBase [68], a database of phosphorylation sites in proteins and peptides (http://www.cbs.dtu.dk/databases/PhosphoBase/). Both databases are maintained by the Center for Biological Sequence Analysis (CBS) in Denmark, which also provides prediction servers for both types of modifications (NetOGlyc and NetPhos).

Although the current number of PTM databases is quite small, we expect them to multiply, with an increasing amount of data on PTM structures becoming available, as well as the amino-acid sequence patterns where they can be found.

Conclusion

It is clear that the databases described above do not cover all the aspects of interest for proteomics researchers. There are databases that use the sequence databases to perform calculation and analysis, such as sequence clustering, phylogeny, or profile searching, and thus create added-value databases. Other databases report results from functional studies and mutational experiments, or from 3-D structure determination, or describe metabolic pathways. Although it is impossible to list them all here, it is of interest to know that they exist. Some of them are permanently updated, some of them have only a short existence, some of them are not publicly available. The existence of databases in proteomics is following a dynamic and continuously developing model, and simply reflects the dramatic evolution the field of proteomics is witnessing.

Automation in proteome analysis

Introduction

As already explained, analysing a proteome first involves the isolation of the biological sample of interest. Then the contained proteins have to be separated in order to allow their individual identification, characterization and possibly quantification (Figure 4.1.). The number of proteins to analyse is very often large, ranging from a few hundreds in low prokaryotes to possibly 500000 to one million in human samples [2].

In order to enable the analysis of entire proteomes, one has to develop scalable technologies. This often means that standard methods have to be automated and, if possible, parallelized.

We already know from the first Sections of this Chapter that the separation of all proteins from a complex biological sample using 2-DE is a well established and widely used technique [2]. It produces thousands of distinct protein samples ready to be further analyzed, and it does so with very few manipulations and in a parallel manner, (see Section 4.2).

From 2-DE gel spots, protein identification can be performed using various methods, as described in Section 4.4. All of these approaches imply a serial process, starting from a single-step or multiple-step chemical processing of the entire proteins and followed by the measurement and the interpretation of the produced protein attributes, then by specific queries in protein sequence database. The outputs of such searches are interpreted and used, first, to annotate the 2-DE image with the identification results and, second, to focus on the biological relevance of the expression level of the interesting protein spots.

A next step in proteome analysis is the characterization of all proteins, i.e. the description of the processing events occurred, of the presence and position of post-translational modifications or of amino acid substitutions.

In this Section we talk about automation. Automation first means ''hands-free'' work. It involves reduction of the technician's work, better reproducibility, reduction of potential human mistakes, an integrated approach, and a sample-tracking system. Very often automation is also correlated with high-throughput and therefore continuously faces bottleneck issues. It has to be noticed that both aspects are complementary and have to be considered seriously. Very high throughput methods do not afford the possibility to interpret and validate all detailed aspects of an analysis. In terms of protein identification and database update, this means that there exists a conflict between making use of all detailed annotations in the databases and validating manually all results on the one hand, and producing massive amount of very high quality computer interpreted data on the other. It is a real challenge for the bioinformatics developers to optimize the gel analysis software as well as the identification and characterisation tools and to release them as a powerful and complementary package. Indeed, these tools have to handle a high volume of data, often in a very short time. They need to be of the best quality and accuracy. The results of the analyses have to be visualized and interpreted in a global way, while providing access to all details.

Robotized protein identification using peptide mass fingerprinting

Manual protein identification from 2-DE gels is a time consuming technique that starts with spot excision and ends with the update of 2-DE databases. Nevertheless, this repetitive and exhaustive procedure could be mechanized using robots, thus increasing protein identification throughput.

tiff image maid! ms pvuf pcflk delrct

MHLAN1H

siic-jtnct;

diLBbaSS

ms/ms dsn jhelame in^

[ddjtin upjulcd

SWlSS-inpAon updnlcc mor recWtpUitni swiss-

2ijpat]e crtlnct

s

swiss.

rjtot

er.lrick

1

Flow chart describing a robotised approach for automated protein identification using 2-DE and mass spectrometry.

In the process of automating the procedure using robotics, one has to pay attention to different elements to be integrated. These constitutive elements include components that are used in a manual approach and are complete with dedicated elements: 1) machines to prepare samples, 2) mass spectrometry equipment to perform the analysis of the digested sample, 3) bio-informatics tools dedicated to identification, 4) databases to be queried, 5) validation and interpretation software, 6) launching algorithms and integration software 7) data acquisition and archiving hardware system. These elements are integrated in a system that can be represented as a flow chart, relating to the experimental protocol.

Figure 4.11 shows a representation of the system integrated in our laboratories. After the 2-DE separation, the gels (or the electroblotted membranes) are stained. An image is scanned. The Melanie image analysis software allows to detect and choose protein spots of interest. A spot excision system (in our case the ARRM214BR, ARRM, Sydney, Aus) cuts the gel pieces and deposits them in microtiter plates. The plates are processed by a first destaining step, followed by an incubation with a trypsin solution and a peptide extraction procedure, using a liquid handling system (in our case the SymBiot Workstation, Perseptive Biosystems, PE Corp., Framingham, MA). This system is able to load different MALDI MS sample plates with peptide solutions and matrix solution. The samples are then ready to be measured, in an automated mode, with a mass spectrometer (in our case, Voyager Elite MALDI-TOF MS, Perseptive Biosystems, PE Corp., Framing-ham, MA). After the submission of all MS data to an improved peak detection and calibration algorithm, the obtained peak tables are sent on-line, together with input parameters, to a protein identification tool (in our case PeptIdent and SmartIdent). These tools provide html result files (see Section 4.4), interpreted by a validation software that allows the user to attribute identification assignment to protein spots (CheckPeptIdent, C. Hoogland, personal communication). The update of a 2-DE database, like SWISS-2DPAGE, is performed with high degree of automation by the database manager.

The major advantages of this robotised approach are the limitation of human interventions, thus an increase in reproducibility and a decrease of the number of potential human errors. In our laboratory, this procedure currently yields a throughput of 400 fully analyzed spots per person and per week. The bottleneck is the validation step, because each individual protein identification has to be validated manually before being introduced into the SWISS-2DPAGE database.

Other similar approaches are under development in various companies. We give a non-exhaustive list here. Bruker Daltonics (http://www.bruker. com/) integrates a MAPII robot and a BiflexIII or a ProflexIII MS and developed AutoXecute software to handle the system, that uses MS Biotools supporting MASCOT (www.matrixscience.com) identification program. Micromass (http://www.micromass.co.uk), together with Bio-Rad (http:// www.bio-rad.com/) implement Proteome Works System to handle an ARRM214BR spot excision system, a TOF Spec-2E MALDI-TOF MS or a Q-TOF MS, and a MassLynx-ProteinLynx software package. Genomic Solutions (http://www.genomicsolutions.com) prepares a system including a Flexys Proteomics Robot, a ProGest Digestion Station, a Pro-MS Robot and a MALDI-TOF MS. The Protein Solution 1 package (Applied Biosystems, http://www.appliedbiosystems.com) integrates the Symbiot workstation with various MS instruments (the Voyager STR MALDI-TOF MS, the Voyager DE-PRO or even the Mariner API-TOF LC-MS) and the Protein Prospector (http://prospector.ucsf.edu) identification tools.

The molecular scanner

Even though the robotized approach is automated and thus requires minimal human intervention, the digestion process is partially sequential, and therefore does not require that all parameters in the manipulation of numerous protein spots are fully identical. The approach is also time-consuming and involves expensive sample handling, due to material costs. In addition, since the size of a sample is limited by the size of the excized spot, problems occur when overlapping spots are present on a gel.

We can then think of a method that parallels the protein digestion step, that reduces sample handling and that decreases the analyzed sample size to that of the MALDI-TOF MS laser beam impact (a spot of a few tens of microns in diameter).

In that respect, and to offer a flexible and powerful proteomic visualization tool, we developed a highly automated method that can create a fully annotated 2-D map starting from a 2-DE. This technology is called "molecular scanner". It combines parallel methods for protein digestion and electro-transfers (using the one-step digestion-transfer (OSDT) or the double parallel digestion (DPD) techniques as described by Bienvenut et al. [10]) with peptide mass fingerprinting approaches to identify proteins directly from PVDF membranes, the surface of which is scanned with MALDI-TOF MS. Using a set of dedicated tools, this procedure allows the creation, analysis and visualization of a set of virtual 2D or 3D images. These images reveal particular aspects of the proteome of interest, whose complexity can be described as a multi-dimensional image (see below for the explanation of the multi-dimensionality of the proteome) [11].

4.6.3.1 The principle of the method

The method can be divided into four main sections (Figure 4.12).

A) Separation and digestion of the proteins. First, the proteins are separated on 1-DE or 2-DE gels. The proteins are digested and simultaneously electrotransferred onto a polyvinylidene difluroride (PVDF) membrane. To do that, a hydrophilic membrane, to which trypsin is co-valently bound, is inserted between the gel and the collecting PVDF membrane in the electro-blot sandwich. This method is called One-Step Digestion-Transfer. The collecting PVDF membrane thus contains sets of digestion products of all proteins, each of them concentrated at discrete positions on the membrane surface.

B) Acquisition of the peptide mass fingerprinting data. The collecting PVDF membrane is then sprayed with a matrix solution and stuck on a modified MALDI sample plate using high vacuum grease as glue. The MS then scans the membrane, i.e. a mass spectrum is acquired at each position of a predefined array. The distance between separate MS acquisition ranges between 0.2 mm and 0.5 mm. Mass spectra are acquired on a VoyagerTM Elite MALDI-TOF mass spectrometer (Perseptive Biosystems, Framingham, MA, USA) equipped with a 337 nm nitrogen laser and a Delayed Extraction device. The set of co-ordinates of the laser

A) Protein separation, digestion and transfer

2-DE-gel membrane with immobilised trypsin collecting PVDF membrane *

collecting PVDF membrane *

Electrotransfe

4.6 Automation in proteome analysis | 111 Automated acquisition of MALDI-MS spectra B)

collecting PVDF membrane applied on MAL DI-TO F MS

TOF-detector

(XfeYbi) 1

Set of MS spectra

Continue reading here: Info

Was this article helpful?

0 0