Protein structure prediction and analysis in a broader context of drug target finding and rational drug design. The figure illustrates one particular method of combining background information on biochemical networks, relevant metabolic or regulatory pathways, the mapping of available structural knowledge onto these pathways, and available expression measurements using a statistical scoring function. The method produces a candidate list of most interesting target proteins within the context and data in question. In an ideal situation, the most promising target proteins have an experimentally solved or a predicted model of high quality attached to them. Such candidates could then be used not only in functional assays and biochemical experiments but in addition to computer aided rational drug design and experiment planning.

Selection of tools for structure analysis, superposition, classification, and visualization





Secondary Structure Computation and Prediction



Visualization and Structure Viewers


Rasmol chime

Cn3d kinemage

3-state secondary structure annotation for protein structures

Secondary structure, accessibility, and hydrogen bonding information Topology cartoons of secondary structure arrangements of proteins from DSSP input

Structure, superposition, alignment and modelling visualization Easy to use molecular structure viewer

Molecular structure viewer and plugin

Molecular structure viewer and plugin

Interactive molecule presentation and annotation system W. Kabsch, C. Sander (distributed via CMBI, NL, G. Vriend) stride.html EMBL, D. Frishman, P. Argos EBI, Cambridge, UK, D. Gilbert et al. spdbv/ SIB, CH, N. Guex et al. R. Sayle

MDL Information Systems Inc.

cn3d.shtml NCBI, Bryant et al. Duke University, J&D. Richardson

Molscript Ligplot

Structural Alignment and Analysis





Production of publication quality graphics for protein structures Publication quality graphics for protein ligand complexes

Superposition of two pdb structures or regions Multiple structural alignment and protein clustering (DDD, FSSP) Double Dynamic Programming based structure alignment Secondary structure element based superposition Alignment of vectors of secondary structures Structural alignment of SCOP

sequences Combinatorial extension (CE) of optimal paths of local structural similarities/compound likeness (CL) searches Checking protein quality and quality plots


Structure quality check modules as part of the modelling package WhatIF [310]

P. Kraulis [311]

University College, London, J. Thornton et al. software University of Reading, A. Martin [294]

EBI, Cambridge, L. Holm [101, 232, 312]

MRC and UCL, London, W. Taylor, C. Orengo [103]

NCI/NIH, N. Alexandrov

NCBI/NIH, Bryant et al. align/

Yale University, M.Gerstein Lab http: //cl. ce.html

UCSD, Bourne et al.

procheck.html University College, London, J. Thornton et al. [313]

CMBI, Nijmwegen, NL, G. Vriend





Alpha shapes Voronoi tesselation

Scoring Potentials, PDF

(probability density functions)


General geometry package usable to analyse packing and accessibility General geometry package usable to analyse packing and accessibility

Archive and evaluation of scoring potentials

http: // alpha/ NCSA, H. Edelsbrunner http: // www. geom.umn. edu/ locate / qhull Qhull program, University of Minnesota, Geometry Center http:// CARB, Rockville, MD, J. Moult et al.

Structure prediction servers and programs

Secondary Structure Prediction

PHD Multiple alignment trained neural network prediction predator multiple alignment based




Threading programs


TOPITS, Predict_Protein threader, threader2

PSI-B1AST generated sequence profiles and NN

Consensus of several predictions

Decision tree based consensus of several predictions

Produce sequence profile with PSI-B1AST

and use it in PSI-BlASTrun against PDB alignment of secondary structure motifs

Threading with pair interaction potentials

Bryant Gibbs Threading Gibbs sampling for sequence-structure alignment with contact potentials

dictprotein.html EMBL/Columbia University, B. Rost argos/ predator/ predator_ info.html

EMBL/MIPS/GSF, Munich, D. Frishman http: // psipred/ Brunei University, UK, D.Jones servers/jpred.html EBI, Cambridge, UK, G. Barton et al. GMD, Sankt Augustin, J. Selbig et al. A. Godzik, L. Rychlewski EMBL, B. Rost http: // threader/ threader.html Brunei University, UK, D. Jones

threading.html NCBI, NIH, S. Bryant et al.

GenThreader ( mGenThreader)





P-Map ssPsi


Sequence (PSI-BLASTprofile) alignment with neural network based evaluation of alignments with structural scores Sequence profiles are threaded against structure profiles using secondary structure preferences and solvation potentials

Consensus of several profile alignment methods with different scoring matrices and profiles Produce sequences with PSI-BLAST and search against PDB Build HMM for sequence; use HMM to predict secondary structure and to search the PDB. Conversely, HMMs for PDB structure are used to score the sequence Library of profiles and substitution matrices from HOMSTRAD structural alignment DB. Sequence or profile is searched against library. PSI-BLAST approach

HMMs for structural families from CATH

http: // psipred/ Brunei University, D. Jones [205, 283]

ICRF, London, M. Sternberg et al. Ben-Gurion University, Israel, D. Fischer [200, 283, 315]

A. Godzik, L. Rychlewski [20, 183, 237]

T99-query.html UC Santa Cruz, CA, K. Karplus University of Cambridge, UK, K. Mizuguchi, T. Blundell et al.

http: // Pages / products 1 .html dnaMining informatics software LLC, Bruno sspsi/ Stockolm University, A. Elofsson http:// foresst2/

Analytical Biostatistics Section, NIH, V. Di Francesco et al.

loopp frsvr







Learning, Observing and Outputting Protein

Patterns Threading with 3D profiles

Dynamic programming method for contact potential threading Profile threading with CCPs (see section 4)

using secondary structure predictions Profile threading with CCPs (see Section 6.4) using PSI-BLAST sequence/structure profiles and optimized scoring term weighting CCP (123D*) and pair contact potential threading with (recursive) dynamic programming (RDP, branch-and-bound algorithm), profiles, and p-values Combination of threading/sequence features with ah initio simulation: assembly of structure fragments with sequence similarity using MC simulations Variety of alignment, analysis and modeling tools, automated modeling server

Cornell University, R. Elber et al. people/ frsvr/ frsvr.html

UCLA, D. Fischer, D. Eisenberg

University of Salzburg, M. Sippl

NIH/NCI, Frederick, MD, N. Alexandrov, R. Zimmer

GMD, Sankt Augustin, R. Zimmer et al.

http:// GMD, Sankt Augustin, R. Zimmer et al.


(Information only) Washington University, D. Baker et al.

Swiss Bioinformatics Institute, Geneva, CH, N. Guex; M. Peitsch

282 | 6 Protein Structure Prediction 6.4.2

Structural genomics

The goal of structural genomics projects is to solve experimental structures of all major classes of protein folds systematically independent of some functional interest in the proteins [238, 239]. The aim is to chart the protein structure space efficiently. Functional annotations and/or assignment are made afterwards.

Several estimates on the number of different folds have been presented. Some estimates have been based on the number of new and related folds and an analysis of structural classifications and their growth during the past years [240]. Other estimates can be derived from systematic predictions on complete genomes [241-243]. A recent analysis [244] of this type, classifying the new entries ofthe central structural depository (PDB [245, 332]), has revealed that the fraction of proteins establishing a new family, superfamily, or fold remains constant (at about 15%) over the years (1990-1997), as compared to new proteins belonging to a known family (15%), known proteins from different species (20%), and new (modified) structures from already resolved proteins (about 50%). In addition, new structural entries for sequences with no detectable sequence similarity to known folds (no significant hit via PSI-BLAST searches) yielded about 30% new folds, about 25% new superfamilies with an already known fold, and 45% new families homologous to a known superfamily.

Bioinformatics can support structural genomics by selecting targets for experimental high throughput structure determination, in order to make the best use of the solved structures; i.e. the chance of determining novel folds among the structures to be solved should be optimized. Several clusterings, based on a range of approaches, are used for the purpose of selecting and prioritizing appropriate targets for further X-ray or NMR investigation (see the protein structure initiative [333]) and for keeping track of structural knowledge on proteins, including models and predictions (Presage server [334]). One goal of the overall structural genomics endeavor is to have a solved structure within a certain structural distance to any possible target sequence, which would allow computing a reliable model for all target sequences. This requirement depends on the future enhancements of the modeling procedures on the one hand and on the metric of structural similarity underlying the structure space and the accompanying clusterings, on the other hand. Once a map of the structure space is available, this knowledge should provide additional insights on what the function of the protein is in the cell and with what other partners it might interact. Such information should add to information gained from high-throughput screening and biological assays.

So far, glimpses of what will be possible could be obtained by analyzing complete genomes or large sets of proteins from expression experiments with the structural knowledge available today, i.e. more or less complete representative sets and a quite coarse coverage of structure space [243, 246, 247].

Recently, several successful individual structure-based functional predictions within the structural genomics projects have been reported: an ATPase or ATP-mediated switch in one example [248], a new NTPase in M. jannaschii in a second example [249]; and other test cases are the HIT family [250], E. coli ycaC [251], HdeA [252], and yjgF gene products [253]. For a recent review on the current capabilities and prospects of function predictions from structure for bona fide hypothetical proteins on a genome-wide scale, see [254].

Selecting targets for structural genomics

The number of different folds is much smaller than the number of proteins. Structure is more conserved than sequence, as nature modularly re-uses successful architectural designs, which allow for some sequence variation. A (continuously increasing) fraction of the possible folds is known. Structural genomics projects aim at producing a representative set of folds. There are increasing chances of finding an appropriate fold for a sequence with unknown structure in the current database of representative folds. Determining a correct fold facilitates the application of homology-modeling techniques to produce a full-atom model structure.

In order to enable the structural genomics projects to focus on the most promising targets for structure determination, structure prediction methods have been applied to estimate the chances of particular proteins to adopt a novel fold, i.e. a fold not yet contained in the structure databases. This should be done for many sequences based on the sequence information alone. Two different approaches have been published. One is based on a clustering of proteins and the distribution of the clusters in sequence and structure space [255]. The other is based on an estimate of erroneous predictions as a function of the corresponding computed sequence-structure compatibility score resulting from a prediction experiment [256]. In this work, Eisenberg and coworkers use a threading program (sequence derived properties SDP [199]) to assign folds to all ORFs of the archaebacterium P. aerophilum. For that purpose they determine a confidence level for an alignment describing the likelihood that it is correct. This confidence is derived from the probability distribution of incorrect matches as judged by DALI/FSSP alignments and z-scores (see Chapter 2 of this volume for the definition of z-score) using a test set of 3285 domains and 10 784 655 pairs achieving a high enough SDP z-score. For the calibrated confidence levels more than 40% of the non-trans-membrane proteins could be as signed a fold with 90% and more than 10% with 99% confidence. In addition, about 11% have an as yet unobserved fold with 90% confidence and for 99% confidence still 14 proteins have been predicted to adopt a novel fold.

Portugaly and Linial used the exhaustive PROTOMAP clustering of sequences [257] and the structural classification SCOP [81] to determine a combined sequence-structure classification of proteins. Based on PROTOMAP 2.0 (72 623 SwissProt sequences clustered into 13 454 clusters, 1999) and SCOP version 1.37 (2294 domains, 834 families, 593 superfamilies, 427 folds, Oct. 1997) and a mapping of structures to sequences, the sequence clusters are classified into vacant and occupied, depending on whether the clusters contain known folds. The PROTOMAP clustering structures the sequences space with a graph (the PROTOMAP graph), the edges between the sequence clusters indicate their proximity. Using several measures derived from the cluster space (distances between the clusters, density of clusters, maximal vacant volume) statistical models are constructed to characterize the vacant clusters containing yet undetermined known folds and vacant clusters with probably novel folds. The procedure is illustrated in Figure 6.7. The clusters having the highest probability for representing an as yet unobserved fold constitute promising targets for experimental structure determination.

Genome annotation

Since only 20-40% of the protein sequences in a genome such as M. genitalium, M. Janaschii, M. tuherculosis have a sequence similarity that can exhibit their paralogy to proteins of known function [242, 258], we need to be able to make conclusions on proteins that exhibit much lower similarities to suitable model proteins. As the similarity between query sequence and model sequence decreases below a threshold of, say, 25% we cannot make safe conclusions on a common evolutionary origin of the query sequence and the model sequence any more. However, it turns out that, in many cases, we can still reliably predict the protein fold, and in several cases we can even generate detailed structural models of protein binding sites. Thus, especially in this similarity range, protein structure prediction can help to ascertain aspects of protein function [243, 246, 247].

Sequence-to-structure-to-function paradigm

Having 300 000 sequences but only a few thousand structures in the databases requires a method to establish relationships of sequences to homologous proteins with known function and/or structure.

Illustration of the PROTOMAP approach and an example of a PROTOMAP graph of protein structure space. Nodes represent sequence clusters; edges represent proximities between clusters in sequence. Lightly and darkly shaded nodes are clusters that contain sequences known to fold into structure A and B respectively. Using these clusters the remaining protein space can be heuristically partitioned into areas that are likely to fold into known structures and those that are likely to fold into novel structures.

Illustration of the PROTOMAP approach and an example of a PROTOMAP graph of protein structure space. Nodes represent sequence clusters; edges represent proximities between clusters in sequence. Lightly and darkly shaded nodes are clusters that contain sequences known to fold into structure A and B respectively. Using these clusters the remaining protein space can be heuristically partitioned into areas that are likely to fold into known structures and those that are likely to fold into novel structures.

The relationship between structure and function is a true many-to-many relation. Recent studies have shown that particular functions can be mounted onto several different protein folds [85] and, conversely, several protein fold classes can perform a wide range of functions [259]. This limits our potential of deducing function from structure. But it is still possible to use aforementioned knowledge on the range of folds supporting a particular function and the range of functions implemented by particular folds in order to make functional prediction from structure.

Homologous proteins have evolved from a common ancestor and have nearly always a similar 3D structure. This does not necessarily imply similar function: [129, 259, 260]. Probabilistic sequence searches have their limits despite the significant improvements due to PSSMs [9, 203, 205], secondary structure information [156-7, 165, 184] and HMMs [25, 179, 261], which have been systematically evaluated by several authors [20,129,147,195, 205].

If a protein has a known fold, which in addition is annotated with a particular or a few specific functions, the model derived for the sequence in question based on the respective fold can be evaluated with respect to compatibility with some of the proposed functions.

In general, this sequence-to-structure-to-function paradigm involves predicting a structural model for a given sequence (either by sequence and/or structural means, e.g. threading) and, afterwards, identifying a 3D pattern, e.g. an active site, indicative for a particular function in the constructed model.

Especially for enzymes, common mechanisms and associated active sites are often observed. This allows searching for appropriate 3D patterns in a database of structures and for checking models to contain such a pattern (for a well known example, see SER-HIS-ASP serine protease catalytic triad [262]).

Sometimes ligand-binding pockets are identifiable on protein surfaces. With such a binding pocket hypothesis, docking tools can be applied in order to screen potential compounds to bind the protein site.

Protein-protein interactions can be predicted with various methods: analyzing known protein interfaces and inferring common motifs via surface and sequence conservation criteria [263-265], protein fusions [95, 266] or, less successful so far, correlated mutations [267].

A systematic assessment of the relationship between protein function and structure has been performed by Hegyi and Gerstein [259] via relating yeast enzymes classified by the Enzyme Commission (EC numbers) to structural SCOP domains. In this study it has been found that different structural folds have different "propensities" for various functions. Most versatile functions (hydrolases and O-glycosyl glucosidases) have been identified to be mounted onto seven different folds, whereas the most versatile folds (e.g. TIM-barrel and Rossmann folds) realize up to 16 different functions.

In addition to conventional sequence motifs (Prosite, BLOCKS, PRINTS, etc.), the compilation of structural motifs indicative of specific functions from known structures has been proposed [268]. This should improve even the results obtained with multiple (one-dimensional sequence) patterns exploited in the BLOCKS and PRINTS databases. Recently, the use of models to define approximate structural motifs (sometimes called fuzzy functional forms, FFFs [269]) has been put forward to construct a library of such motifs enhancing the range of applicability of motif searches at the price of reduced sensitivity and specificity. Such approaches are supported by the fact that, often, active sites of proteins necessary for specific functions are much more conserved than the overall protein structure (e.g. bacterial and eukaryotic serine proteases), such that an inexact model could have a partly accurately conserved part responsible for function. As the structural ge-nomics projects produce a more and more comprehensive picture of the structure space with representatives for all major protein folds and with the improved homology search methods linking the related sequences and structures to such representatives, comprehensive libraries of highly discriminative structural motifs are envisionable.

Whether or not it is possible to infer protein function, however detailed, from a experimental structure, or from an approximate model, or even from only a partial fold model is a key question in the context of large-scale functional annotation. Currently, the situation is that sometimes such deri vations are possible; in more cases additional evidence for functions predicted with sequence methods or from functional motifs can be established; in most cases detailed functional classification from inaccurate structures cannot be guessed.

Though it is not easy to derive functions from resolved protein structures, the availability of structural information improves the chances as compared to relying on sequence methods alone.

Validation of predictions

Theoretical predictions are risky. Therefore for almost all such prediction experimental validation is required. Nevertheless, often the models can indicate appropriate ways for validation or further experiments. These experiments can be expected to be time-consuming, and expensive. Furthermore, the protein actually needs to be available for the suggested experiments. All of this limits the applicability of experimental validation. Therefore, it is mandatory to reduce errors as much as possible and to indicate the expected error range via computer-based predictions. This is not a trivial problem for structure prediction, though. An estimation of the performance and accuracy of the respective methods can be obtained from large scale comparative benchmarking, from successful blind predictions and from a community wide assessment experiment (CASP [109, 229]/ CAFASP [283]). These are addressed in turn in the following:

Benchmark set tests

Structure prediction methods have to be validated using comprehensive and representative benchmark sets. A couple of benchmark sets have been proposed [188, 199, 270] consisting of a fold library and a list of structurally similar protein pairs without significant sequence similarity. Such a benchmark [335] involves using the method in question to thread one protein of each pair against the library and to count the number of successful recognitions of the structural match or the number of correct identifications of related protein pairs. A couple of methods have been evaluated this way.

More convincing are benchmarks representative for the currently known folds. Thus, given a structural classification of all the structures in the PDB (e.g. SCOP) and a list of proteins without high sequence similarities, say less than 40% pairwise identities (e.g. PDB40 [336], which can be obtained from, e.g. the ASTRAL server [271] or from a couple of other sources for such representative lists [337]), one can easily determine a complete list of structural similarities from the structural classification. I.e. a pair of proteins from PDB40 are considered structurally related if they belong to the same structural class (depending on the level, the same superfamily or the same fold). Other lists of structurally related pairs (via structural alignment of the 3D structures) are available (e.g. from the DALI/FSSP database [338]). A comprehensive benchmark then involves threading each sequence of the PDB40 against all other members [20, 272, 273]. Two benchmarking protocols can be defined: 1. (Fold recognition) for each threading run identify the best fitting fold (usually the one with the best score, sometimes after applying filtering or reranking of alignments or models in a post-processing step) and predict it as the closest structural homologue in the fold library and use it as structure template for model building. Of interest in this protocol is the number or percentage of correctly predicted structures. 2. (Related pair recognition) for the all-to-all threadings and the respective scores define a score threshold above which threadings are considered predictions of structural similarity. As only one score threshold is applied for the individual threading runs, the scores need to be normalized with respect to the length and other features of the query protein in order to render the scores comparable to each other. Computing p-values from the scores is a method to achieve comparability among experiments and to assign a significance measure to each prediction [273]. This leads to a list of predicted pairs with associated p-values, which can be compared to the list of standard-of-truth pairs. In the simplest case this leads to the number of correctly predicted pairs (TP), of incorrectly predicted pairs (FP), and the number of missed correct pairs (FN). Indeed, the ranked list of scored pairs contains much more information and allows yielding a complete overview of the performance of the method for all possible threshold values of the p-values. This is represented as a so-called sensitivity/specificity plot. Given a p-value threshold the following values of sensitivity Sens and specificity Spec can be easily read off the ranking list:

Sens = TP/{TP + FN) and Spec = TP/{TP + FP), where TP = true positives, FN = false negatives, FP = false positives, the predicted pairs (positives) change according to the p-value threshold, the negatives remain constant. The sensitivity is the fraction of correct pairs above the threshold as compared to all correct pairs, and the specificity (1 — error rate) is the fraction of correct pairs above the threshold as compared to all pairs above the threshold. Now the set of all possible {Sens, Spec) pairs for all possible thresholds can be simply obtained for the ranking list by iteratively decreasing the threshold from the maximum achieved p-value, to the next highest p-value, down to the lowest p-value observed. For each consecutive threshold value the {Sens, Spec) values are updated according to whether the new pair above the threshold is a true (TP) or false (FP) structural pair. Example plots for evaluating the fold recognition performance of 123D with five different significance criteria are shown in Figure 6.8. From such a plot, the expected performance of a certain method using certain parameter settings can be read off for any given error rate to be acceptable for the user.

Blind prediction experiments (CASP)

A blind prediction is a computer-based construction of 3D models of a protein sequence for which no structure is known at the time. CASP (comparative assessment of structure prediction methods [109, 274-276]) is a worldwide contest of protein structure prediction that takes place every two years. During the CASP experiments, a set of automated numerical evaluation tools have been implemented [110, 111, 277, 278] to cope with the large number of predictions in a way that is as objective as possible. In fact, the experiment is also devoted to the research and development of such unbiased methods. However, there is still quite some controversy on the criteria to judge protein structure predictions and the corresponding models [339].

For the CASP2 experiment, Marchler-Bauer and Bryant defined a large number of quantitative criteria [110] to be used by the assessor to judge and rank the models and prediction teams. These criteria include several measures for fold recognition specificity, for threading alignment specificity, threading contact specificity, and model accuracy.

For an evaluation of the CASP3 predictions, Sippl used the ProSup rigid sequence-independent 3D superposition software [111, 279]. The basic assumption is that the best prediction corresponds to the best structural match between the predicted model and the experimental structure, but that the best structural match cannot always be determined without ambiguity automatically. Therefore, all possible structural matches between model and structure are considered and ambiguities are resolved in favor of the predictor to determine the closest correspondence between the prediction and structural matches. Ranking a set of structures according to their similarity to a reference fold or model defines a totally ordered similarity measure. Unfortunately, structural superpositions are characterized by two contradicting quantities, the number of equivalenced residues and the best possible rms achieved by optimally superposing the equivalent residues, which can be individually optimized compromising the other. A solution proposed by Sippl for CASP3 follows a similar strategy as used in the protein superposition tool SARF [103, 280], i.e. considering only superpositions with a rms smaller than a constant value (say 3.0 A) and maximizing the number of residues within that rms range. Usually, this leads to an easily interpreted measure of structural similarity and corresponding rankings of structures [111]. In order to evaluate a prediction, the maximum number of equivalent residues is determined for all possible structural matches between model and structure. The conclusion of the numerical evaluation is that it is difficult to clearly discriminate between appropriate and inappro-

Sensitivity-specificity plots. The figure shows example sensitivity/specificity plots for several fold recognition criteria (raw threading score, z-score, p-value, gap-score, gap-z-score) for a benchmark experiment of 123D* threading on a PDB40 domain library using sequence profiles generated for the query sequence via PSI-BLAST. Such plots allow an

immediate overview of the recognition performance (sensitivity, x-axis) over the whole range of error rates (specificity, y-axis). In this case, one can achieve a maximum sensitivity of about 78% (at any error rate), more than 60% sensitivity for a specificity of more than 95% for all but one criterion, for which about 8% error (Spec = 0.92) has to be tolerated in order to recognize more than 60% of the folds.

Sensitivity-specificity plots. The figure shows example sensitivity/specificity plots for several fold recognition criteria (raw threading score, z-score, p-value, gap-score, gap-z-score) for a benchmark experiment of 123D* threading on a PDB40 domain library using sequence profiles generated for the query sequence via PSI-BLAST. Such plots allow an

immediate overview of the recognition performance (sensitivity, x-axis) over the whole range of error rates (specificity, y-axis). In this case, one can achieve a maximum sensitivity of about 78% (at any error rate), more than 60% sensitivity for a specificity of more than 95% for all but one criterion, for which about 8% error (Spec = 0.92) has to be tolerated in order to recognize more than 60% of the folds.

priate structural matches as often the ranked list of PDB structures matching a target structure shows a continuum of matched residues. In addition, different criteria, such as equivalent residues according to the best structural match or according to the match producing the maximum number of equivalence, or minimum shift error of aligned residues, lead to a general trend of good and bad models, but in most cases also to debatable decisions on which models are best or better than others. The comparison of the results with structural matches obtained from other methods, e.g. the Dali structural alignment, where a z-score is computed from the two quantities and used for ranking, leads to quite different results [111].

Therefore, an alternative approach by Hubbard [277] avoids coming up with a single number qualifying a prediction, but rather a plot of rms vs coverage of a number of predictions. The plot is produced such that for any prediction there is a set of points (drawn as a line) representing for each coverage value the minimum rms out of a sampling of superpositions. Such plots allow for a comparative general overview of all predictions: one can easily identify outstanding predictions for a given structural similarity (i.e. rms) or coverage (i.e. a given number or percentage of aligned residues) and, in addition, assess predictions for the whole range of structural similarities or coverages. The plots for the CASP3 targets and the submitted predictions are available at the CASP Web site [340].

Another approach used for CASP3 is sequence-dependent superposition - in contrast to the two approaches mentioned above which are sequence independent relying on the best structural superposition of the 3D model without reference to the natural 1:1 alignment between model and template. In difficult cases, sequence-dependent superposition iteratively excludes residues being modeled less accurate from the computation [278]. This procedure does not necessarily lead to unambiguous solutions. The method allows defining and displaying another set of evaluation criteria for fold recognition and individual alignment accuracy such as longest continuous segments and largest superimposable sets of residues. These criteria can also be restricted to certain subsets of the structure, such as secondary structure elements, buried or exposed residues etc. Results of these measures for the CASP predictions are also available from the CASP Web site.

Due to the debatable outcome of the various methods for numerical evaluation of protein structure prediction the final assessment of models based on the quantitative measures is kind of a creative task similar to the inherent ambiguities encountered in structural classifications [196]. This is especially true for the range of sequence and structure similarities encountered for CASP fold recognition targets, where similarities and partial similarities are hard to identify even for the best human experts in the field. Indeed, a slightly more general definition of the SCOP fold similarity has been used for the CASP3 assessment [196]. The overall result of the structure prediction experiment relevant for this Chapter (fold recognition), which can be considered as an approximation to the overall state-of-the-art in the field, is as follows (according to Table I in [196]): 22 of the 26 target domains show substantial similarities to known structures, three proteins are classified as new fold, one domain has not been classified. Eleven domains are likely to be distantly homologous to known structures. Nine of the 11 homologous relationships have been detected, four of which with accompanying overall correct alignments and models. The other 11 domains with known folds are predicted as follows: five folds identified, one correct topology predicted, two partial structure correctly predicted, and only three folds have not been identified by any method. Two of the three new folds have been predicted correctly as new. Overall, 19 of the target domains have been solved by one or more of the prediction teams, only six could not be solved, and one domain remains unclassified.

In accordance with the focus of this chapter on current proteins structure prediction methods, and despite some considerable improvements of ah initio methods (e.g. the fragment assembly approach ROSETTA [37], sometimes called "mini-threading approach" and an approach based exclusively on a global optimization of a potential energy function [35]) as compared to CASP2, CASP3 has shown no emphasis on "physical methods", i.e. methods concentrating on the balance of forces determining structure or folding pathways leading to native structures [109]. It is also apparent that currently the homology-based methods have much more predictive power and it seems to be like that for some time to come.

Conclusion: strengths and limitations


Threading allows, given a query sequence, to search for sequences with a similar fold but without apparent sequence similarity. Therefore, threading has three major objectives: first, provide orthogonal evidence of possible homology for distantly related protein sequences, second, detect possible homology in cases where sequence methods fail, and third, to improve structural models for the query sequence via structurally more accurate alignments.

Towards all three goals continuous improvements have been achieved during the up-to-date four rounds of the CASP experiment. The CASP experiments are the main indicator of the progress in protein structure prediction. Whereas the evaluation of progress is not as easy as first expected, it turns out that, for fold recognition, detectable progress can be observed from CASP1 to CASP2, where there is no significant progress between CASP2 and CASP3, but the same performance was achieved on more difficult targets. In CASP4, again, manually applied knowledge-based approaches to distant fold recognition outperformed automated methods and servers. Improvements have been made especially on the prediction of new folds where both manually (A. Murzin) and the ab inito/fragment assembly approach [5] performed well on a couple of targets with novel folds. The latter approach performed well on average also for fold recognition and comparative modeling targets [281]. But there appears to be a certain limit of current fold recognition methods, which is still well below the limit of detectable structural similarity (via structural comparisons). In addition, in CASP3 several groups produced reasonable models of up to 60 residues for ab initio target fragments, in CASP4, Baker could demonstrate good average performance of the fragment assembly approach.

We provide a summary of an estimate of the current state-of-the-art, in CASP3. Out of 43 protein targets, 15 could be classified as comparative homology modeling targets, i.e. related folds and accompanying alignments could be derived beyond doubt. For more than half of the 21 more difficult cases, a reasonable model could be predicted by at least one of the participating prediction teams. In addition, the CAFASP subsection of the assessment has demonstrated that 10 out of 19 folds could be solved via completely automatic application of the best threading methods. The complete set of results obtained during the CASP experiments, including the recent CASP4 assessment (summer/winter 2000), can be viewed and analyzed through the CASP Web page at the prediction center [341].

Methods for refining rough structural models towards the true native structure of the query protein are not straightforward, but are an active area of research [282].


The strengths of structural predictions is that structural and in some cases functional classifications and annotations can be made, where such an annotation is not possible at all with pure sequence-based methods. In general, if there is detectable sequence similarity the corresponding annotation should overrule contradicting - even high confidence - structural predictions. In cases with insignificant sequence similarity, a significant threading alignment can provide additional evidence for an annotation. In addition to extending the range of remote homology searches, a knowledge-based structure prediction always results in a structural model, which can be inspected and evaluated with other means. In particular, such models can be completed to full atom models accessible to energy minimization and theoretical model validation procedures. The model can also trigger the validation and cross-checking with additional experimental facts and in turn can indicate further experiments either to validate the model or to check for putative functions of interactions with other partners.

In addition, experimental methods become available for checking predicted models in a cheap, fast and efficient manner. One such method uses chemical cross-links of a certain length and specificity and allows for validating distance constraints between residues and groups implied by a model. The method uses a protocol involving chemical cross-linking, digestion by specific hydrolases, measurement of peptide masses via (MALDI-TOF) mass spectrometry, inferring distance constraints for the protein under investigation and checking the experimental distances against the proposed model [32]. This appears to be a very promising approach to boost the significance and thus the practical application of theoretical structure prediction approaches, especially for proteins, which are not easily amenable to experimental structure determination, i.e. proteins difficult to crystallize, membrane proteins, or highly flexible proteins.

294 | 6 Protein Structure Prediction 6.6.3


Still structure prediction is an unsolved problem and the methods available are rough and heuristic. The outcome of predictions is not easily interpreted by the non-specialist and can easily be over-interpreted by biologists if taken too seriously. The computation of suitable p-values to allow for a straightforward judgment of the respective prediction remains problematic for many methods and parameter settings. In addition, models derived from similarities on the fold level often simply are not as structurally similar as one could hope for, i.e. not very usable for detailed structural analysis. Variations and differences apart from conserved structural cores have to be expected frequently even for highly significant structural predictions.

This is one reason why in general reliable function predictions cannot be inferred from structural models and mere significance of structural similarity at the level of folds. The current state-of-the-art of understanding structure-function relationships requires additional evidence to make such predictions. This, however, could be available via additional experimental measurements (e.g. expression data) or knowledge on biochemical and protein interaction networks.

Protein interaction/fusion:

1. Recently, the Rosetta stone method has been introduced. This method uses over 20 completely sequenced genomes and analyzes evolutionary correlations of two domains being fused into one protein in one species and occurring in separate proteins in another species. From these classifications the method establishes pairwise links between functionally related proteins [94] and elicits putative protein-protein interactions

2. The phylogenetic profile method analyzes the co-occurrence of genes in the genomes of different organisms, for the same purpose [96].

One major reason for the currently not satisfying performance of knowledge-based prediction methods is the obvious deficiency in the currently used empirical scoring functions. As has been demonstrated there are many methods to improve the capabilities of the potentials by calibrating parameters or weightings of the scoring system [149]. This hints to clear deficiencies of the available potentials. As more data become available the database-derived potential can be expected to become more accurate.

The other major disadvantage today is that prediction methods do not efficiently exploit the available heterogeneous information on function, motifs, family information, network topology, sequence and structure space topology during the prediction process. This is clearly demonstrated by the fact that the prediction performance and alignment accuracy of human experts armed with the above mentioned prediction methods has been much better than automated versions of the methods alone [109, 283]

As has been briefly mentioned above the sensitivity of current methods can be expected to be quite low for the detection of related structural pair recognition and can be estimated at a quite satisfying performance of about 50% for finding a suitable structural template. The rather high error rate of around 50% has to be tolerated to achieve these sensitivities. This seems to be acceptable if the correctly identified models are new targets and of sufficient interest for further investigation to compensate for the false hints. If appropriate experimental assays are available, probably even a hit rate of 1 in 10 could be acceptable in target finding applications. As mentioned above, together with additional theoretical evaluation and evidence, and appropriate experimental techniques and validation, the success rate can be expected to be much higher.


Many of the resources, programs, databases, and servers discussed in this Chapter and relevant for protein structure prediction are freely available over the internet. The field, and thus the available resources, change quickly according to the progress made on collected experimental data and new methods. The Tables in the Appendix provide a current overview.

Table 6.1 contains a selection of protein structure resources and databases (coordinates, models, classifications, representative and benchmark sets) available on the WWW with the major protein structure archives and structure coordinate resources.

Table 6.2 lists a selection of tools for structure analysis, superposition, classification, and visualization.

Table 6.3 contains a list of structure prediction servers and programs. See also the CAFASP home page for further information and for results achieved for CASP4 targets and structures newly submitted to the PDB (check LiveBench [342] and EVA [343] for current updates and evaluations).


This work has been supported by the BMBF project TargId under contract no 0311615 and DFG-funded projects ZI 616/1-1 and ZI 616/2-1.

296 I 6 Protein Structure Prediction References

1 Branden, C. and J. Tooze, Introduction to Protein Structure. 1991: Garland Publishing Inc.

2 Creighton, T. E., Proteins: Structures and Molecular Properties. 1983: W. H. Freeman.

3 Anfinsen, C. B., Principles that govern the folding of protein chains. Science, 1973. 181(96): p. 223-30.

4 Levinthal, C., Chim. Phys., 1968. 65: p. 44-45.

5 Simons, K. T., C. Strauss, and D. Baker, Prospects for ab initio protein structural genomics. J Mol Biol, 2001. 306(5): p. 1191-9.

6 Gotoh, O., An improved algorithm for matching biological sequences. J Mol Biol, 1982. 162(3): p. 705-8.

7 Needleman, S. B. and C. D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol, 1970. 48(3): p. 443-53.

8 Smith, T. F. and M. S. Waterman, Identification of common molecular subsequences. J Mol Biol, 1981. 147(1): p. 195-7.

9 Altschul, S. F., et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 1997. 25(17): p. 3389-402.

10 Altschul, S. F. and E. V. Koonin, Iterated profile searches with PSI-BLAST - a tool for discovery in protein databases. Trends Biochem Sci, 1998. 23(11): p. 444-7.

11 Altschul, S. F., et al., Basic local alignment search tool. J Mol Biol, 1990. 215(3): p. 403-10.

12 Schaffer, A. A., et al., IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics, 1999. 15(12): p. 1000-11.

13 Zhang, Z., et al., Protein sequence similarity searches using patterns as seeds. Nucleic Acids Res, 1998. 26(17): p. 3986-90.

14 Pearson, W. R. and D. J. Lipman, Improved tools for biological sequence comparison. Proc Natl Acad Sci USA, 1988. 85(8): p. 2444-8.

15 Pearson, W. R., Flexible sequence similarity searching with the FASTA3 program package. Methods Mol Biol, 2000. 132: p. 185-219.

16 Pearson, W. R., Empirical statistical estimates for sequence similarity searches. J Mol Biol, 1998. 276(1): p. 71-84.

17 Pearson, W. R., Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. Genomics, 1991. 11(3): p. 635-50.

18 Pearson, W. R., Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol, 1990. 183: p. 6398.

19 Barrett, C., H. R., Karplus, K., Scoring hidden Markov models. Comput Appl Biosci., 1997. 13(2): p. 191-9.

20 Park, J., et al., Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J Mol Biol, 1998. 284(4): p. 1201-10.

21 Karchin, R. and R. Hughey, Weighting hidden Markov models for maximum discrimination. Bioinformatics, 1998. 14(9): p. 772-82.

22 Bateman, A., et al., Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nucleic Acids Res, 1999. 27(1): p. 260-2.

23 McClure, M. A., C. Smith, and P. Elton, Parameterization studies for the SAM and HMMER methods of hidden Markov model generation. Ismb, 1996. 4: p. 155-64.

24 Sonnhammer, E. L., et al., Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res, 1998. 26(1): p. 320-2.

25 Eddy, S. R., Profile hidden Markov models. Bioinformatics,

26 Eddy, S. R., Hidden Markov models. Curr Opin Struct Biol,

27 Eddy, S. R., Multiple alignment using hidden Markov models. Ismb, 1995. 3: p. 114-20.

28 Eddy, S. R., G. Mitchison, and R. Durbin, Maximum discrimination hidden Markov models of sequence consensus. J Comput Biol, 1995. 2(1): p. 9-23.

29 Brocchieri, L. and S. Karlin, A symmetric-iterated multiple alignment of protein sequences. J Mol Biol, 1998. 276(1):

30 Sippl, M. J., Calculation of conformational ensembles from potentials of mean force. An approach to the knowledge-based prediction of local structures in globular proteins. J Mol Biol, 1990. 213(4): p. 859-83.

31 Henikoff, S., J. G. Henikoff, and S. Pietrokovski, Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations. Bioinformatics,

32 Hoffmann, D. S., V. Wefing, Stephan, Albrecht, M., Hanisch, D., Zimmer, R., A new method for the fast solution of Protein-3D-Structures, combining experiments and bioinformatics. in Caesarium 2000. 2001. Bonn: Springer Verlag.

33 Hanisch, D., R. Zimmer and T. Lengauer, ProML - the Protein Markup Language for specification of protein sequences, structures and families, Proceedings GCB2001, to appear.

34 Murzin, A. G. and A. Bateman, Distant homology recognition using structural classification of proteins. Proteins,

35 Lee, J., et al., Calculation of protein conformation by global optimization of a potential energy function. Proteins, 1999. Suppl(3): p. 204-8.

36 Simons, K. T., et al., Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J Mol Biol, 1997. 268(1): p. 209-25.

37 Simons, K. T., et al., Ab initio protein structure prediction of CASP III targets using ROSETTA. Proteins, 1999. 37(S3): p. 171-176.

38 Lathrop, R. H., The protein threading problem with sequence amino acid interaction preferences is NP-complete. Protein Eng, 1994. 7(9): p. 1059-68.

39 Chothia, C., Principles that determine the structure of proteins. Annu Rev Biochem, 1984. 53: p. 537-72.

40 Richardson, J. S., The anatomy and taxonomy of protein structure. Adv Protein Chem, 1981. 34: p. 167-339.

41 Pauling, L. C. and R. B. Corey, Configurations of polypeptide chains with favored orientations around single bonds: two new pleated sheets. PNAS, 1951. 37: p. 729-740.

42 Pauling, L. C., R. B. Corey and H. R. Branson, The structure of proteins: two hydrogen-bonded helical configurations of the polypeptide chain. PNAS, 1951. 37: p. 205-211.

43 Lo Conte, L., et al., SCOP: a structural classification of proteins database. Nucleic Acids Res, 2000. 28(1): p. 257-9.

44 Bray, J. E., et al., The CATH Dictionary of Homologous Superfamilies (DHS): a consensus approach for identifying distant structural homologues. Protein Eng, 2000. 13(3): p. 153-65.

45 Orengo, C. A., et al., CATH - a hierarchic classification of protein domain structures. Structure, 1997. 5(8): p. 1093-108.

46 Chou, P. Y. and G. D. Fasman, Conformational parameters for amino acids in helical, beta-sheet, and random coil regions calculated from proteins. Biochemistry, 1974. 13(2): p. 211-22.

47 Lim, V. I., Structural principles of the globular organization of protein chains. A stereochemical theory of globular protein secondary structure. J Mol Biol, 1974. 88(4): p. 857-72.

48 Garnier, J., D. J. Osguthorpe, and B. Robson, Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J Mol Biol, 1978. 120(1): p. 97-120.

49 Shoemaker, B. A., J. J. Portman, and P. G. Wolynes, Speeding molecular recognition by using the folding funnel: the fly-casting mechanism. Proc Natl Acad Sci USA, 2000. 97(16): p. 8868-73.

50 Hardin, C., et al., Associative memory hamiltonians for structure prediction without homology: alpha-helical proteins. Proc Natl Acad Sci U S A, 2000. 97(26): p. 14235-40.

51 Socci, N. D., J. N. Onuchic, and P. G. Wolynes, Protein folding mechanisms and the multidimensional folding funnel. Proteins, 1998. 32(2): p. 136-58.

52 Onuchic, J. N., Z. Luthey-Schulten, and P. G. Wolynes, Theory of protein folding: the energy landscape perspective. Annu Rev Phys Chem, 1997. 48: p. 545-600.

53 Wolynes, P. G., Folding funnels and energy landscapes of larger proteins within the capillarity approximation. Proc Natl Acad Sci USA, 1997. 94(12): p. 6170-5.

54 Wolynes, P., Z. Luthey-Schulten, and J. Onuchic, Fast-folding experiments and the topography of protein folding energy landscapes. Chem Biol, 1996. 3(6): p. 425-32.

55 Onuchic, J. N., et al., Protein folding funnels: the nature of the transition state ensemble. Fold Des, 1996. 1(6): p. 441-50.

56 Onuchic, J. N., et al., Toward an outline of the topography of a realistic protein-folding funnel. Proc Natl Acad Sci USA, 1995. 92(8): p. 3626-30.

57 Dinner, A. R., et al., Understanding protein folding via free-energy surfaces from theory and experiment. Trends Biochem Sei, 2000. 25(7): p. 331-9.

58 Lazaridis, T. and M. Karplus, Effective energy functions for protein structure prediction. Curr Opin Struct Biol, 2000. 10(2): p. 139-45.

59 Dinner, A. R., et al., Factors that affect the folding ability of proteins. Proteins, 1999. 35(1): p. 34-40.

60 Chan, H. S. and K. A. Dill, The protein folding problem. Physics Today, 1993. 46(2): p. 24-32.

61 Dill, K. A. and H. S. Chan, From Levinthal to pathways to funnels. Nat Struct Biol, 1997. 4(1): p. 10-9.

62 Chan, H. S. and K. A. Dill, Protein folding in the landscape perspective: chevron plots and non-Arrhenius kinetics. Proteins, 1998. 30(1): p. 2-33.

63 Dill, K. A., et al., Principles of protein folding - a perspective from simple exact models. Protein Sei, 1995. 4(4): p. 561602.

64 Yue, K., et al., A test of lattice protein folding algorithms. Proe Natl Acad Sei USA, 1995. 92(1): p. 325-9.

65 Recktenwald, A., D. Schomburg, and R. D. Schmid, Protein engineering and design. Method and the industrial relevance. ] Biotechnol, 1993. 28(1): p. 1-23.

66 Aehle, W., et al., Rational protein engineering and industrial application: structure prediction by homology and rational design of protein-variants with improved 'washing performance' - the alkaline protease from Bacillus alcalophilus. ] Biotechnol, 1993. 28(1): p. 31-40.

67 Bornscheuer, U. T. and M. Pohl, Improved biocatalysts by directed evolution and rational protein design. Curr Opin Chem Biol, 2001. 5(2): p. 137-43.

68 Root, M. J., M. S. Kay, and P. S. Kim, Protein Design of an HIV-1 Entry Inhibitor. Science, 2001. 11: p. 11.

69 Guerois, R. and L. Serrano, Protein design based on folding models. Curr Opin Struct Biol, 2001. 11(1): p. 101-6.

70 Karlin, S. and S. F. Altschul, Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci USA, 1990. 87(6): p. 2264-8.

71 Karlin, S. and S. F. Altschul, Applications and statistics for multiple high-scoring segments in molecular sequences. Proc Natl Acad Sci USA, 1993. 90(12): p. 5873-7.

72 Rost, B., C. Sander, and R. Schneider, PHD - an automatic mail server for protein secondary structure prediction. Comput Appl Biosci, 1994. 10(1): p. 53-60.

73 Bateman, A., et al., The Pfam protein families database. Nucleic Acids Res, 2000. 28(1): p. 263-6.

74 Sander, C. and R. Schneider, Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins, 1991. 9(1): p. 56-68.

75 Corpet, F., et al., ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons. Nucleic Acids Res, 2000. 28(1): p. 267-9.

76 Apweiler, R., et al., InterPro - an integrated documentation resource for protein families, domains and functional sites. Bioinformatics, 2000. 16(12): p. 1145-50.

77 Holm, L. and C. Sander, Touring protein fold space with Dali/FSSP. Nucleic Acids Res, 1998. 26(1): p. 316-9.

78 Dietmann, S., et al., A fully automatic evolutionary classification of protein folds: Dali Domain Dictionary version 3. Nucleic Acids Res, 2001. 29(1): p. 55-7.

79 Siddiqui, A. S., U. Dengler, and G. J. Barton, 3Dee: a database of protein structural domains. Bioinformatics, 2001. 17(2): p. 200-1.

80 Dengler, U., A. S. Siddiqui, and G. J. Barton, Protein structural domains: analysis of the 3Dee domains database. Proteins, 2001. 42(3): p. 332-44.

81 Murzin, A. G., et al., SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol, 1995. 247(4): p. 536-40.

82 Hubbard, T. J., et al., SCOP, Structural Classification of Proteins database: applications to evaluation of the effectiveness of sequence alignment methods and statistics of protein structural data. Acta Crystallogr D Biol Crystallogr, 1998. 54(1 (Pt 6)): p. 1147-54.

83 Attwood, T. K., et al., PRINTS-S: the database formerly known as PRINTS. Nucleic Acids Res, 2000. 28(1): p. 225-7.

84 Huang, J. Y. and D. L. Brutiag, The EMOTIF database. Nucleic Acids Res, 2001. 29(1): p. 202-4.

85 Kasuya, A. and

Continue reading here: R1nc R2 R3nh2 H

Was this article helpful?

0 0