Homology Modeling in Biology and Medicine

Roland L. Dunbrack, Jr. 5.1


The concept of homology modeling

To understand basic biological processes such as cell division, cellular communication, metabolism, and organismal development and function, knowledge of the three-dimensional structure of the active components is crucial. Proteins form the key players in all of these processes, and study of their diverse and elegant designs is a mainstay of modern biology. The Protein Databank (PDB) of experimentally determined protein structures [1, 2] now contains some 18 000 entries, which can be grouped into between 300 and 700 related families based either on structure or on sequence similarity [3-5]. The fact that proteins that share very little or no sequence similarity can have quite similar structures has led to the hypothesis that there are in fact on the order of 1000-7000 different families [6, 7] which have been adapted by a process of duplication, mutation, and natural selection to perform all the biological functions that proteins perform.

Since it was first recognized that proteins can share similar structures [8], computational methods have been developed to build models of proteins of unknown structure based on related proteins of known structure [9]. Most such modeling efforts, referred to as homology modeling or comparative modeling, follow a basic protocol laid out by Greer [10, 11]: 1) identify a template or parent structure related to the target sequence of unknown structure, and align the target sequence to the template sequence and structure; 2) for core secondary structures and all well-conserved parts of the alignment, borrow the backbone coordinates of the template according to the sequence alignment of the target and template; 3) for segments of the target sequence for which coordinates cannot be borrowed from the template because of insertions and deletions in the alignment (usually in loop regions of the protein), build these segments using some construction

Bioinformatics - From Genomes to Drugs. Volume I: Basic Technologies. Edited by Thomas Lengauer Copyright © 2002 WILEY-VCH Verlag GmbH, Weinheim ISBN: 3-527-29988-2

method based on our knowledge of the determinants of protein structure; 4) build side chains determined by the target sequence on to the backbone model built from the template structure and loop construction. In practice, identifying a structural homologue requires aligning the sequences, and so steps 1 and 2 are performed together. Step 2 is often a manual adjustment of the alignment, but automated sequence alignment methods different from those used to make the identification may be used. Also, Steps 3 and 4, backbone and side-chain modeling, may be coupled, since certain backbone conformations may be unable to accommodate the required side chains in any low-energy conformation. An alternative strategy has been developed by Blundell and colleagues, based on averaging a number of template structures, if these exist, rather than using a single structure [1214]. More complex procedures based on reconstructing structures (rather than perturbing a starting structure) by satisfying spatial restraints using distance geometry [15] or molecular dynamics and energy minimization [16-19] have also been developed.

Many methods have been proposed to perform each of the steps in the homology modeling process. There are also a number of research groups that have developed complete packages that take as input a sequence alignment or even just a sequence and develop a complete model. In this Chapter, we describe some of the basic ideas that drive loop and side-chain modeling individually and the programs publicly available that implement them. We also discuss some of the programs that perform all of the steps in modeling. The identification and alignment steps will be covered in the next Chapter of this volume.

How do homologous proteins arise?

By definition, homologous proteins arise by evolution from a common ancestor. But there are several different mechanisms in play, and these are illustrated in Figure 5.1. The first is random mutation of individual nu-cleotides that change protein sequence, including missense mutations (changing the identity of a single amino acid) as well as insertions and deletions of a number of nucleotides that result in insertion and deletion of amino acids. As a single species diverges into two species, a gene in the parent species will continue to exist in the divergent species and over time will gather mutations that change protein sequence. In this case, the genes in the different organisms will usually maintain the same function. These genes are referred to as orthologues of one another. A second mechanism is duplication of a gene or of a gene segment within a single organism or germ line cell. As time goes by, the two copies of the gene may begin to gather mutations. If the parent gene performed more than one function, for example similar catalytic activity on two different substrates, one of the

Orthologues vs. paralogues. Schematic of evolutionary process that gives rise to homologous proteins. Left: A single gene X in one species is retained as the species diverges into two separate species. The genes in these two species are orthologous. Right: A single gene X in one species is duplicated. As each gene gathers mutations, it may begin to perform new functions, or the two genes may specialize in carrying out two or more functions of the ancestral gene, thus improving the fitness of the organism. These genes in one species are paralogous. If the species diverges, each daughter species may maintain the duplicated genes, and therefore each species contains an orthologue and a paralogue to each gene in the other species.

Orthologues vs. paralogues. Schematic of evolutionary process that gives rise to homologous proteins. Left: A single gene X in one species is retained as the species diverges into two separate species. The genes in these two species are orthologous. Right: A single gene X in one species is duplicated. As each gene gathers mutations, it may begin to perform new functions, or the two genes may specialize in carrying out two or more functions of the ancestral gene, thus improving the fitness of the organism. These genes in one species are paralogous. If the species diverges, each daughter species may maintain the duplicated genes, and therefore each species contains an orthologue and a paralogue to each gene in the other species.

duplicated genes may gain specificity for one of the reactions, while the other gene gains specific activity for the other. If this divergence of specificity in the two proteins is advantageous, the duplication will become fixed in the population. These two genes are paralogues of one another. If the species with the pair of paralogues diverges into two species, each species will contain the two paralogues. Each gene in each species will now have an orthologue and a paralogue in the other species.

The purposes of homology modeling

Homology modeling of proteins has been of great value in interpreting the relationships of sequence, structure, and function. In particular, ortholo-gous proteins usually show a pattern of conserved residues that can be interpreted in terms of three-dimensional models of the proteins. Conserved residues often form a contiguous active site or interaction surface of the protein, even if they are distant from each other in the sequence. With a structural model, a multiple alignment of orthologous proteins can be interpreted in terms of the constraints of natural selection and the requirements for protein folding, stability, dynamics, and function.

For paralogous proteins, three-dimensional models can be used to interpret the similarities and differences in the sequences in terms of the related structure but different functions of the proteins concerned [20]. In many cases, there are significant insertions and deletions and amino acid changes in the active or binding site between paralogues. But by grouping a set of related proteins into individual families, orthologous within each group, the evolutionary process that changed the function of the ancestral sequences can be observed. Indeed, homology models can serve to help us identify which protein belongs to which functional group by the conservation of important residues in the active or binding site [21]. A number of recent papers have been published that use comparative modeling to predict or establish protein function [22-26].

Another important use of homology modeling is to interpret point mutations in protein sequences that arise either by natural processes or by experimental manipulation. Now that the human genome project has produced a rough draft of the complete human genome sequence, there are starting to be collected voluminous data concerning polymorphisms and other mutations related to differences in susceptibility, prognosis, and treatment of human disease. There are now many such examples, including the Factor V/Leiden R506Q mutation [27] that causes increased occurrence of thrombosis, mutations in cystathionine beta synthase that cause increased levels of homocysteine in the blood, a risk factor for heart disease [28], and BRCA1 for which many sequence differences are known, some of which may lead to breast cancer [29]. At the same time, there are many polymorphisms in important genes that have no discernible effect on those who carry them. At least for some of these, there may be some effect that has yet to be measured in a large enough population of patients, and the risk of cancer, heart disease, or other illness to these patients is unknown. This is yet another important application of homology modeling, since a good model may indicate readily which mutations pose a likely risk and which do not.

Homology models may also be used in computer-aided drug design, especially when a good template structure is available for the target sequence. For enzymes which maintain the same catalytic activity, the active site may be sufficiently conserved that a model of the protein provides a reasonable target for computer programs which can suggest the most likely compounds that will bind to the active site (see also Chapter 7 of this volume). This has been used successfully in the early development of HIV protease inhibitors [30, 31] and in the development of anti-malarial compounds that target the cysteine protease of P. falciparum [32].

The effect of the genome projects

The many genome projects now completed or underway have greatly affected the practice of homology modeling of protein structures. First, the many new sequences have provided a large number of targets for modeling. Second, the large amount of sequence data makes it easier to establish remote sequence relationships between proteins of unknown structure and those of known structure on which a model can be built. The most commonly used methods for establishing sequence relationships such as PSI-BLAST [33] are dependent on aligning many related sequences to compile a pattern or profile of sequence variation and conservation for a sequence family. This profile can be used to search among the sequences in the PDB for a relative of the target sequence. The more numerous and more varied sequences are in the family the more remote are the homologous relationships that can be determined, and the more likely it is that a homologue of known structure for a target sequence can be found. Third, it is likely that the accuracy of sequence alignments between the sequence of unknown structure that we are interested in (referred to as the target sequence) and the protein sequence of known structure used for model building (referred to as the template structure) are also greatly improved with profiles established from many family members of the target sequence [34]. Fourth, the completion of a number of microbial genomes has prodded a similar effort among structural biologists to determine the structures of representatives of all common protein sequence families, or all proteins in a prototypical genome, such as E. coli or yeast [35]. Protein structures determined by X-ray crystallography or NMR spectroscopy are being solved at a much faster pace than was possible even 10 years ago. The great increase in the number of solved protein structures has a great impact on the field of homology modeling, since it becomes ever more likely that there will be a template structure in the PDB for any target sequence of interest.

Given the current sequence and structure databases, it is of interest to determine what fraction of sequences might be modeled and the range of sequence identities between target sequences and sequences of known structure. In Figure 5.2, we show histograms of sequence identities of the sequences in several genomes and their nearest relatives of known structure in the Protein Data Bank. These relationships were determined with PSI-BLAST as described in the legend. PSI-BLAST is fairly sensitive in determining distant homology relationships [34, 36, 37]. The results indicate that on average 30-40% of genomic protein sequences are related to proteins of known structure, which presents a large number of potential targets for homology modeling. But it should also be pointed out that the average sequence identity between target sequences and template structures in the PDB is less than 25%. It is likely that as sequence comparison methods improve, our ability to identify increasingly remote homologies will in-


60 80



20 15 10 5


20 40

80 100

Percent identity

Distribution of sequence identities between proteins in four genomes and their closest homologues in the Protein Databank for those sequences in genomes with homologues in the PDB. PSI-BLASTwas used to search the nonredundant protein sequence database with a representative set of PDB sequences as queries. The program was run for four iterations, with a maximum expectation value of 0.0001 (see Chapter 2 for an explanation of this value) used to determine sequences which are included in the position-specific similarity matrix. After four iterations, each matrix was used to search each of the four genomes. Coiled-coil and low-sequence complexity sequences were removed from each genome and the non-redundant sequence database. All hits in the genomes with Evalues less than 0.001 were saved, and the histograms were built from the PSI-BLAST derived sequence identities.

crease, and we will be able to determine relationships at even lower sequence identity with confidence [38].

The low sequence identity between target and template sequences in Figure 5.2 presents a major challenge for homology modeling practitioners, since a major determinant in the accuracy of homology modeling is the sequence identity between the target sequence and the sequence of the template structure. At levels below 30% sequence identity, related protein structures diverge significantly and there may be many insertions and deletions in the sequence [39]. At 20% sequence identity, the average RMSD of core backbone atoms is 2.4 A [39]. But as demonstrated in Figure 5.2, it is likely that we will most often face a situation where the target and template sequences are remotely related. Most widely used homology modeling methods have been predicated on much higher sequence identities between template and target, usually well above 30% [40-42]. What methods should be used at sequence identities in the 10-30% range is of crucial importance in this so-called post-genomic era.

Input data

To produce a protein model that will be useful and informative requires more than placing a new sequence onto an existing structure. A large amount of sequence data and other kinds of experimental data can often be gathered on the target sequence and on its homologue of known structure to be used for model building. This information can be used to build a better model and as the data to be interpreted in light of the model. The goal is to forge an integrated model of the protein sequence, structure, and function, not merely to build a structure. In Table 5.1, we list the kinds of information which might be available for a target protein and how these data might be processed. With the large amount of sequence information available, it is almost always possible to produce a multiple alignment of

Input information for homology model building Target sequence

. Target orthologous relatives (from PSI-BLAST) . Target paralogous relatives (from PSI-BLAST)

. Multiple sequence alignment of orthologues and paralogues (either BLAST multiple alignment or (preferably) other multiple alignment program) . Sequence profile of ortho/paralogues Template sequence

. Homologue(s) of known structure (template(s)) determined by database search methods (BLAST, PSI-BLAST, intermediate-sequence-search methods, HMM's, fold recognition methods) . Template orthologous sequences . Template paralogous sequences

. Multiple sequence alignment of parent orthologues and paralogues Alignment of target sequence to template sequence and structure . Pairwise alignment . Profile alignment

. Multiple sequence alignment of target and template sequence relatives . Profile-profile alignment . Fold recognition alignment

. Visual examination of proposed alignments and manual adjustment . Assessment of confidence in alignment by residue (some regions will be more conserved than others) Structure alignment of multiple templates, if available . Align by structure (fssp, VAST, CE, etc.)

. Compare sequence alignments from structure to sequence alignments from multiple sequence alignments (see above) Experimental information

. Mutation data (site-directed, random, naturally occurring) . Functional data - e.g. DNA binding, ligands, metals, catalysis, etc.

sequences related to the target protein. The first step in modeling therefore is to use a database search program such as PSI-BLAST [33] against a nonredundant protein sequence database such as NCBI's nr database [43] or the curated SwissProt database [44]. With some care, a list of relatives to the target sequence can be gathered and aligned. PSI-BLAST provides reasonable multiple alignments, but it may be desirable to take the sequences identified by the database search and realign them with a multiple sequence alignment program such as CLUSTAL W [45], prrp [46], and multalin [47]. Another source of protein sequences related to the target is the database of expressed sequence tags (ESTs) available at NCBI. ESTs are derived from sequencing DNA transcriptions of mRNAs derived from cellular samples from various species or tissue types. While these sequences contain some non-coding nucleotides, the bulk of the sequence represents transcribed codons. This DNA sequence database can be searched with protein sequence queries using the TBLASTN program (http://www.ncbi.nlm.nih.gov/blast). NCBI also makes available the sequences of unfinished microbial genomes for TBLASTN searches (http://www.ncbi.nlm.nih.gov/Microb_blast/ unfinishedgenome.html), and a search of these may provide additional sequences related to the target.

It may be that a database search consisting of several rounds of PSI-BLAST will provide one or more sequences of known three-dimensional structure. If this is not the case then more sensitive methods based on fold recognition [48-79] (see Chapter 6 of this volume) or hidden Markov models (see Appendix to this volume) [80-87] of protein superfamilies may identify a suitable template structure. Once a template structure is identified, a sequence database search will provide a list of relatives of the template, analogous to searches for relatives of the target. At this stage it is useful to divide the sequences related to the target into orthologues and paralogues of either the target or the template (or both). The sequence variation within the set of proteins that are orthologous to the target provides information as to what parts of the sequence are most conserved and therefore likely to be most important in the model. Similarly variation in the set of proteins that are orthologous to the template provide a view of the template protein family that can be used to identify features in common or distinct in the template and target families. These features can be used to evaluate and adjust a joint multiple alignment of both families.

If there are multiple structures in the Protein Data Bank that are homologous to the target sequence, then it is necessary to evaluate them to determine which PDB entry will provide the best template structure and whether it will be useful to use more than one structure in the modeling process. In the case of a single sequence that occurs in multiple PDB entries, it is usually a matter of selecting the entry with the highest resolution or the most appropriate ligands (DNA, enzyme inhibitors, metal ions). In other cases, there may be more than one homologue related to the target sequence, and the task is to select the one more closely related to the target or to combine information from more than one template structure to build the model. To do this, a structure alignment of the potential templates can be performed by one of a number of computer programs available on the Internet (Dali [88], CE [89], etc., see Table 5.6). From alignments of the target to the available templates, the location of insertions and deletions can be observed and often it will be clear that one template is better than others. This may not be uniform, however, such that some regions of the target may have no insertions or deletions with respect to one template, but other regions are more easily aligned with the other template. In this case, a hybrid structure may be constructed [13].

Finally, any other experimental information available on the target or template proteins may be very helpful in producing and interpreting a structural model. This can include inhibitor studies, DNA binding and sequence motifs, proteolysis sites, metal binding, mutagenesis data, and so forth. A number of databases are available on the web that summarize information on particular genes, or that collect information on mutations and polymorphisms linked to disease including: the Cancer Genome Anatomy Project [90]; the Online Mendelian Inheritance in Man (OMIM) [91]; YDB (Yeast database), WormDB (C. elegans database), and PombeDB (Schizosac-charomyces pombe database), all available from http://www.proteome.com/ databases/); and the Human Gene Mutation Database [92].

Finally, we note that many methods in homology modeling rely on statistical data on protein conformations. We present a primer on some aspects of protein structure in the Appendix to this Chapter.


Modeling at different levels of complexity

Once an alignment is obtained between the target and a protein of known structure (as described in "Input Data" or in Chapter 6 of this volume), it is possible to build a series of models of increasing sophistication. Simple model

Keep backbone and conserved side chains by renaming and renumbering coordinates in the template structure with the new sequence using the alignment of target and template; rebuild other side chains using a side-chain modeling program (e.g., SCWRL [93, 94]); do not model insertions or deletions; that is, do not build new loops and do not close up gaps. For this purpose, we have made available a Perl program called blast2model that uses BLAST alignments as input and the SCWRL program to build model structures (http://www.fccc.edu/research/labs/dunbrack/software.html). Stepwise model

Borrow core backbone from template structure, minus coil regions with insertions or deletions in the sequence alignment; rebuild core side chains; rebuild coil regions with loop prediction method in conjunction with side-chain prediction method. Core backbone and side chains may or may not be held fixed during loop prediction (e.g., CONGEN [95]). Global model

Build entire protein from spatial restraints drawn from known structure(s) and sequence alignment (e.g., MODELLER [16, 17]). Choosing a model

It is not always the case that more sophisticated models are better than simpler, less complete ones. If elements of secondary structures are allowed to move away from their positions in the template and large changes are made to accommodate insertions and deletions, it may be the case that the model is further away from the target structure (if it were known) than the template structure was to begin with. This is the "added value" problem discussed by John Moult at the CASP meetings [96-98]. We would like methods that change the template structure closer to the target structure, such that they "add value" to a simple model based on an unaltered template structure, perhaps with side chains replaced. Extensive energy minimization or molecular dynamics simulations often bring a model further away from the correct structure than toward it [99].

The simple model is sometimes justified when there are no insertions and deletions between the template and target or when these sequence length changes are far from the active site or binding site of the protein to be modeled. This often occurs in orthologous enzymes that are under strong selective pressure to maintain the geometry of the active site. Even in non-orthologous enzymes, sometimes we are most interested in an accurate prediction of the active site geometry and not in regions of the protein distant from the active site. Unless loop modeling is accurate, it is possible that modeled loops in more sophisticated models may get in the way of side-chain modeling. If uncertain parts of the chain are not modeled, then side chains may find their correct positions and the model may be superior to a more complete version.

A stepwise model is probably the most common method used in homol-ogy modeling, since it is conceptually simpler than the more complex models and since each piece can be constructed and examined in turn. Some programs therefore proceed by taking the sequence-structure alignment, removing all regions where there are insertions and deletions, and reconstructing loops and side chains against the fixed template of the remaining atoms. Some methods may also allow all parts of the template structure to adjust to the changes in sequence and insertions and deletions. This usually takes the form of a Monte Carlo or molecular dynamics simulation [18]. A global model, as described above, rebuilds a structure according to constraints derived from the known template structure or structures. This is in contrast to stepwise models that proceed essentially by replacing parts of the template structure and perhaps perturbing the structure.

Many computer programs for homology modeling are developed to solve a single problem - such as loop or side-chain building, and may not be set up to allow all atoms of the protein to adjust or to model many components simultaneously. In many cases these methods have been tested by using simplified modeling situations with the newly developed software. Such examples include experiments with removing and rebuilding loops onto single protein structures, and stripping and rebuilding all side chains. Therefore in the next sections we review some of the work in these two areas.

Loop modeling Input information

In stepwise construction methods, backbone segments which differ in length between the template and target (according to the sequence alignment) need to be rebuilt. In some situations, even when the sequence length of a coil segment is maintained, it may be necessary to consider alternative conformations to accommodate larger side chains or residues with differing backbone conformational requirements, Gly ^ non-Gly, or Pro ^ non-Pro mutations (see Appendix). Most such loop construction methods have been tested only on native structures from which the loop to be built has been removed. But the reality in homology modeling is more complicated, requiring several choices to be made in building the complete structure. These include how much of the template structure to remove before loop building; whether to model all side-chains of the core before rebuilding the loops; and whether to rebuild multiple loops simultaneously or serially.

Deciding how much of the template structure to remove before loop building depends on examination of the sequence alignment and the template structure. Sequence alignments with insertions and deletions are usually not unambiguous. Most sequence alignment methods ignorant of structure will not juxtapose a gap in one sequence immediately adjacent to a gap in another sequence. That is, they will produce an alignment that looks like this alignment:



rather than like this one:


However, the latter alignment is probably more realistic [100], indicating that a 5 amino acid loop in the first sequence and structure is to be replaced with a 3 amino acid loop in the second sequence. The customary practice is to remove the whole segment between two conserved secondary structures units. Even with this practice, ambiguity remains, since the ends of secondary structures, especially a-helices, are not well determined. If loop building methods were accurate, then removing more of the segment would be a good idea. But long loops (longer than 6 amino acids) are difficult to rebuild accurately, and hence there is cause to preserve as much of the starting structure as possible. Once the backbone has been borrowed from the template in stepwise modeling, one has to decide the order of building the core side chains, the backbone of loops to be built, and their side chains. They may be built sequentially, or allowed to vary simultaneously. Side chains from the core may guide the building of the loop, but at the same time may hinder correct placement. It is certainly the case that in the final structure there must be a reasonably low-energy conformation that can accommodate all loops and side chains simultaneously. Different authors have made different choices, and there has been little attempt to try vary the procedure while keeping the search algorithm and potential energy function used [74, 101-105] fixed. Loop conformational analysis

Loop structure prediction is always based in one way or another on an understanding of loop conformations in experimentally determined structures. Loop conformational analysis has been performed on a number of levels, ranging from classification of loops into a number of distinct types to statistical analysis of backbone dihedral angles. Loop classification schemes have usually been restricted to loops of a particular size range: short loops of 1-4 residues, medium loops of 5-8 residues, and long loops of 9 residues or longer.

Thornton and coworkers have classified ¿-turns, which are short loops of 2-5 amino acids that connect two anti-parallel ¿-sheet strands [106-109]. These loops occur in a limited number of conformations that depend on the sequence of the loop, especially on the presence of glycine and proline residues at specific positions. The backbone conformation can be characterized by the conformations of each amino acid in terms of regions of the Ram-achandran map occupied (usually defined as ¿Sp, ¿E, yR, aL, and yL; see Appendix) [108]. Usually one or more positions in the loops requires an aL conformation and therefore a glycine, asparagine, or aspartic acid residue. One useful aspect of this analysis is that if a residue varies at certain positions or there are short insertions at certain positions, the effect on the loop can be predicted [109] since the number of possibilities for each length class is small. The program BTPRED is available (see Table 5.6) to predict the locations of specific types of ¿-turns from protein sequences and secondary structure predictions derived from other programs [110]. Single amino acid changes tend to maintain the loop conformations, except when Pro residues substitute for residues with ^ > 0° (see Appendix to this Chapter), while insertions change the class of the loop.

In recent years with a larger number of structures available, medium length loops have also been classified [111-117] by their patterns of backbone conformation residue by residue (ap, ¿Sp, etc.). A number of regularly occurring classes have been found, depending on length, type of secondary structure being connected, and sequence. These classes cover many but by no means all of the loops seen in non-S turn contexts. The work of Oliva et al. [114] is probably the most thorough classification to date.

Longer loops (> 8 amino acids) have been investigated by Martin et al. [118] and Ring et al. [119]. Martin et al. found that long loops fall into 2 classes, those that connect spatially adjacent secondary structures and those that connect secondary structures separated by some distance. Ring et al. provided a useful classification of longer loops as either strap (long extended loops), ^ loops (similar to those described by Leszczynski [113]) which resemble the Greek letter, and f loops, which are non-planar and have a zigzag appearance. The different loop types were found to have different distributions of virtual Ca-Ca-Ca-Ca dihedrals to accommodate their shapes.

Swindells et al. [120] have calculated the intrinsic y propensities of the 20 amino acids from the coil regions of 85 protein structures. The distribution for coil regions is quite different than for the regular secondary structure regions, with a large increase in ¿Sp and aL conformations and much more diverse conformations in the ¿E and aR regions. Their results also indicate that the 18 non-Gly, Pro amino acid type are in fact quite different from each other in terms of their Ramachandran distributions, despite the fact that they are usually treated as identically distributed in prediction methods [95, 121]. Their analysis was divided into the main broad regions of the Ramachandran map, ignoring the a.L region. The results are intriguing, in that the probability distributions are distinct enough even when calculated from a relatively small protein dataset. See the Appendix to this Chapter for examples. Loop prediction methods

Loop prediction methods can be analyzed for a number of important factors in determining their usefulness: 1) method of backbone construction; 2) what range of lengths are possible; 3) how widely is the conformational space searched; 4) how are side chains added; 5) how are the conformations scored (i.e., the potential energy function); and 6) how much has the method been tested (length, number, self/non-self). We summarize a large number of published methods in Table 5.2. Only a subset of these methods are available as programs that can be downloaded or as web-servers that can perform calculations on input structures. This is indicated in Table 5.2, and the addresses for available programs are listed in Table 5.6. Database methods

The most common approach to loop modeling involves using "spare parts" from other (unrelated) protein structures [11, 13, 119, 122-133]. These database methods begin by measuring the orientation and separation of the backbone segments flanking the region to be modeled, and then search the PDB for segments of the same length which span a region of similar size and orientation. This work was pioneered by Jones and Thirup [122]. They defined a procedure in which Ca-Ca distances were measured among six residues, three on either side of a backbone segment to be constructed. These 15 Ca-Ca distances were used to search structures in the Protein Databank for segments with similar Ca-Ca distances and the appropriate number of intervening residues. Other authors have used the same method for locating potential database candidates for the loop to be constructed [124, 127, 128, 130].

In recent years, as the size of the PDB has increased, database methods have continued to attract attention. With a larger database, recurring structural motifs have been classified for loop structures [111, 114, 116, 129, 134], including their sequence dependence. Database methods have been applied only for loops of up to 8 residues. Fidelis et al. [126] found that for loops of length greater than 7 there is not likely to be a segment in the PDB that corresponds to the correct loop conformations for the 58 protein they looked at. Some authors report that when the database contains a loop of similar structure to the target to be modeled, their methods perform well, with RMSD values around 1A or better. Otherwise they tend to fail [124, 127, 130].

Although many methods have been published, they have usually only been tested on a small number of loops, and then usually in the context of rebuilding loops onto their own backbones, rather than in the process of homology modeling. These numbers are listed in Table 5.2. One exception to the small numbers of tests is the work of Fechteler et al. [128], who developed a database method (implemented in the program BRAGI) and tested it on 71 insertion and 74 deletion regions of 1-3 amino acids in length. Another is the work of Rufino et al., who used their loop classification scheme predictively on 1785 loops [129]. Unfortunately, they found their database not particularly successful in making predictions. Their database could only be used to make a prediction in 63% of cases, and only 54% of these predictions were considered correct. Construction methods

The main alternative to database methods is construction of loops by random or exhaustive search mechanisms. Moult and James [135] early on used a systematic search to predict loop conformations up to 6 residues long. They pioneered several useful concepts in loop modeling by construction: the use of a limited number of y pairs for construction; construction from each end of the loop simultaneously; discarding conformations of partial loops that cannot span the remaining distance with those residues left to be modeled; using side-chain clashes to reject partial loop conformations; and the use of electrostatic and hydrophobic free energy terms in evaluating predicted loops. Their method successfully predicted the structures of two loops in trypsin. The CONGEN program of Bruccoleri and Karplus [18, 95, 136, 137] is also based on an exhaustive search algorithm, this time based on an evenly spaced grid of backbone conformations (either 15° or 30°) in accessible regions of the Ramachandran map. Gly and Pro are treated as separate classes, while the remaining 18 amino acids are treated identically in terms of their backbone energies and allowed range. Amino acids are built in turn, with partial conformations discarded if it will be impossible to close the loop with the remaining amino acids to be built. Energies are evaluated with the CHARMM polar-hydrogen force field [138]. Side chains are built by exhaustive search on 30° increments of side-chain dihedrals, as each residue is built. Since the method has been tested only on a small number of immunoglobulin CDR loops, it is difficult to tell how generally successful the algorithm is likely to be.

A particularly interesting loop construction algorithm is the scaling-relaxation method of Zheng et al. [139-141]. In this method, a full segment is sampled and its end-to-end distance is measured. If this distance is longer than the segment needs to be, then the segment is scaled in size so that it fits the end-to-end distance of the protein anchors. This results in very short bond distances, and unphysical connections to the anchors. From

Loop methods



Side chains

Potential energy

Abagyan [147, 149, 2921

ICM (S3250 per year)

Searched with Monte Carlo probability function

Molecular mechanics + solvation terms

Bruccoleri, Karplus [95, 136, 201]


Generated on grid of X angles


Carlacci, Chou, Englander [293, 294]

Part of simulation

ECEPP/2; vdW + electrostatic, hydrogen bonds, torsional potential; £ = 2; united-atom + polar hydrogens; harmonic force on N, Ca, C, O atoms of terminal residues on x-ray positions

Construction: Internal coordinate Monte Carlo from biased probability distribution function for dihedrals

Construction: Grid-based search + closure with Go-Scheraga + minimization with CHARMM; unproductive branches are truncated

Construction: MC/ simulated annealing; fixed template; fixed bond lengths and angles; random translations and rotations of modeled segment + random changes in dihedrals in simulated annealing

5 CASP2 deletion loops of 4-5 aa and 2 CASP2 insetion loops (10,13 aa)

12 Ig loops [136]

helix, sheet, loops 5-9 aa of BPTI only

Deletion loops 1.19-2.55 A RMSD;

Insertion loops 8.1, 11.4A RMSD

Ig loops: 0.6-2.6 A for backbone atoms

9 aaloop = 1.87 A backbone rms

Collura, Higo, Robson, Gamier [295-297]

Part of simulation Robson-Platt potential [298]; vdW + electrostatic; torsions on side chains; r-dependent dielectric; united-atom + polar hydrogens

Dudek, Palmer, Can be imple- Added after ECEPP + hydration

Scheraga mented in backbone function

[299-301] ECEPP deformation +

(S195) minimization

Construction: MC/ simulated annealing; increasing weight of diiferent energy terms non-uniformly; anchors fixed; initial conformation is extended peptide with harmonic potential centered on x-ray positions applied to 4 backbone atoms at each end; rigid geometry except dihedrals Construction: Random deformation (in acceptable (fn// regions spec, for each aa) + minimization + add side chains + minimization

8 self-protein loops in Ig, BPTI, trypsin, 6-8 aa

RMS 1A for backbone; more complicated solvent models not necessarily helpful

5 BPTI self-loops 1 BT self-loop, all length 7

Depends on knowing initial structure. Low RMS 0.16, 0.27, 0.91, 1.15, 0.32, 0.89. Deformations are random tfix// pairs for 3 residues and analytic calculation of remaining (fn// to close loop



Side chains

Potential energy

Janardhan, Vajda [102]

Can be imple-menteed in CONGEN

Built with CONGEN

CHARMM + surface-solvation term + side-chain entropy term

Moult, James, Fidelis [126, 135]

Rotamer library, complete search

Electrostatic term, image-charge solvation term, van der Waals, hydrophobic term

Rao, Teeter [148] Can be imple- Part of MD mented in simulation


AMBER £ = 4r; counterions present

Construction: CONGEN: 6 CASP1 loops systematic search

Construction: by 11 (fn// pairs; truncate unproductive branches; build many copies from each end and link; allow closer than normal contacts

Construction: AMBER 50 ps simulation

2SGT loops based on bovine trypsin, 4, 5 residues; 11 homology loops in 4 proteins; 4-6 aa in length One 7-residue loop in purothionin built from the corresponding 7-residue loop in crambin

Evaluating free energies for CAS PI predictor loops + CONGEN side chains; importance of internal energy terms; side-chain entropy in loop predictions

Hydrophobic term very useful; database methods are not adequate even for short loops

Successfully reproduced purothionin loop with MD and simulated annealing

Rapp, Friesner [104]

Can be implemented in MacroModel, ($995)

Added with RSA of Shenkin, rotamer based, after initial generated loops

AMBER, AMBER94 + Born solvation implemented in MacroModel


Isoteric placement + Ponder & Richards rotamers

Simple steric clash check

Shenkin, Fine, Wang, Yarmush, Levinthal [143, 144]

Implemented in None in early steps; DISCOVER,

Insightll as "TWEAK"

added for low energy loops and searched on 10° grid, then minimized with PAKGGRAF


Construction: Initial loops generated randomly with (fn// library. Loops that fit are saved. 10 000 step substructure minimization; then MD and simulated annealing; minimization Construction: genetic algorithm used to search among statistical distribution of tetrapeptide conformers; TWEAK algorithm used to refine model [143, 144] Construction: Random-tweak: starting random structure is constrained by Lagrange multipliers to fit the Jones-Thirup base distances; max change in dihedrals = 10° at each step, until convergence; followed by MD/min or min.

2 self-loops, 8 and 12 aa from ribonuclease A

4 CDR loops in antibody HY-HEL5

CDR's of MCPC603

r-dependent dielectric should fail badly; 12 aa loop at 0.81 A; AMBER94 needed, not AMBER united atom model Better results than other methods on this antibody

Success on short loops; multiple minima for longer CDR; need to add side chains to restrict some of the mimima



Side chains

Potential energy

Sowdhamini, Ramakrishnan, Balaram [145]

Added after search and disulfide bond formation, minimized; Janin rotamers

MODIP force field

Sudarsanam, None Simple steric

Srinivasan checks.

Construction: Uniform Disulfide bonded random tfix// (G, P, non- structures, 15 GP) in allowed regions; aa if can form disulfide, it is made, followed by energy minimization

Construction: Random 7 peptides, 6-20 draws from statistical aa in length distribution of 1 for all 400 possible pairs of aa types, further subdivided by neighboring aa types (resulting in tetramer distributions). 10000 conformers generated and clustered and those that fit anchors used as prediction

Disulfide loop predictions: conotoxin, endothelin, 3-5 A rms; generate 350000 conformations to find loops for 15 aa Various conformations can be generated, including helix and sheet.

Zheng, Delisi, Kyle Can be imple- Included in CHARMM, no

[139-141] mented in simulation solvation; £ = 4r



Bates, Oliva, Summers procedure CHARMM

Sternberg to construct

[114,131,305] similar side chains; otherwise McGregor ss-dep

Fechteler, Lessel, BRAGI Not predicted Contact distances

Schomburg only

Greer [11]

Maximal overlap with old side chain

Hagler potential

Construction: Scaling-relaxation method in CONGEN: reduce random loops to fit anchors and then scale to size gradually with ABNR minimization; random loops from (fn// distribution Database: [114] like Jones-Thirup based on Ca positions; followed by annealing in Quanta

Database: clustered fragments to reduce library size [134]; fit by comparison of internal distance Ca matrices Database: Alignment of multiple structures related to target to define variable regions (VR's); database fragments aligned to Ca steric check, build side chains with overlap, restrained minimization

8 self-loops from 6 proteins, 7 aa each

Lowest E is lowest RMS 7/8 times; Zheng93 0.83-

1.44 A RMS bb Zheng96 has 0.38 A

9 CASP2 loops, 2-9 aa in length

71 insertions, 74 deletions, 1-8 aa

CASP2: best loop may be most probable, not one with lowest energy because of secondary structure shifts Good results on short indel regions

No validation No comparison to known structures

Authors Availability Side chains Potential energy

Koehl, Delarue Homology handled with Tuffery van der Waals only

[127] program library and SCMF


Pellequer, Chen Can be imple- Tuffery rotamer CHARMM + finite

[306] mented in library [165] difference Pois-

CHARMM son Boltzmann electrostatics

Rooman, Wodak, Thornton [115]


Database: Database of 81

8 gaps in BPTI,

Loop closure:

proteins scanned for

each 4 aa;


Jones-Thirup distances;

same 8 based

RMS; 67%

minimized first; mean

on homology;

Xi (as in sc

field theory

6 other non-


self loops in

0.19-0.90 for


homology loops

Database: Placement from

6 CDR loops

Gas phase energy

database followed by

from 1

most discrimi

energy minimization;


native; solva

selection based on free

tion free energy

energy cycle calculation

helped in some

with CHARMM +




Database: Sequence

1464 loops

Sequence alone

pattern to predict loops

in loop prediction is not very



Rufino, Donate, Sloop Not included None. Loops chosen

Canard, by sequence and

Blundell anchor positions

[111,129] in comparison to database loops.

Samudrala, Moult [198, 199]


Side-chain rotamer sampled and evaluated with database potential

Interatomic potential from database

Shepherd, Gorse, BTPRED None

Thornton [110]


Database: by class of loop, 1785 loops 1-8 aa long, based on length, secondary structure, vectors, sequence compared to class [111]

Database: Graph theory and clique finding among backbone segments derived from PDB

Database: Prediction of /?-turn loop type by sequence in comparison to conformational analysis of /? turns in structures with a neural network

22 CASP2 loops in 3 proteins; 4 CDR loops in antibody

3359 P turns

Predictions for loops that can be matched to a class in database: under lax criteria, predictions can be made for 63% of loops, of which 54% are correct.

Database method remains limited; improvement over CASP1

Correct prediction of turn type in 47/59 cases (80%).

Authors Availability Side chains Potential energy

Summers, Karplus Isosteric CHARMM, r-

[124] replacements dependent dielectric van Vlijmen, Karplus [130]

Wilmot and Thornton [107, 108]

Reduced side-chain model of Levitt; CB-only in minimization of initial loops; later bbdep rotamer library None



Database: uses Jones-Thirup 15 Ca-Ca distances of 3 residues on either side of loop; 31 protein database; constrained energy minimization of fragments Database: Jones-Thirup distances + Cfl atoms

18 non-self loops, 0-4 aa in length

18 self-loops

Database: Prediction of 59 loops /?-turn loop type by sequence in comparison to conformational analysis of /? turns in structures

2/3 of loops were well-modeled

8/18 up to 9 aa < 1.07 Â; 17/18 had correct loop in top 3 conformations

Correct prediction of turn type in 47/59 cases (80%).

Wojcik, Mornon, Chomilier

No side chains


Methods for loop prediction are listed. When publicly available programs can be identified, they have been listed by name. Some methods have been implemented in publicly available programs, but no specific software or scripts are available.

Abbreviations used in table: MC - Monte Carlo; aa - amino acid; vdW - van der Waals potential; Ig - immunoglobulin or antibody; CDR - complementarity-determining regions in antibodies; RMS -root-mean-square deviation; r-dependent dielectric - distance-dependent dielectric constant; e - dielectric constant; MD - molecular dynamics simulation; self-loops - prediction of loops performed by removing loops from template structure and predicting their conformation with template sequence; bbdep - backbone-dependent rotamer library; SCMF - self-consistent mean field; PDB - Protein Data Bank; Jones-Thirup distances - interatomic distances of 3 Ca atoms on either side of loop to be modeled.

Database: Classification database of 3-8 aa loops [116]; choice determined by sequence + distance of ends; let anchors rotate tfiy minimize to close loops; minimization with X-PLOR

3-8 aa in length; 0.5-2.8Á back-13 CASP3 bone RMS for loops CASP3 loops there, energy minimization is performed on the loop, slowly relaxing the scaling constant, until the loop is scaled back to full size. The method is efficient, since many loops can be generated which span the anchors, closing the loop with reasonable dihedral angles. Again, this method has been tested on only a few proteins, mostly immunoglobulins. Using the same methodology, Rosenbach and Rosenfeld have addressed the important issue of simultaneously modeling two or more loops in proteins [142]. In one test case, they found it advantageous to model two loops simultaneously, so that each felt the presence of the other in energy calculations

Other methods have built chains by sampling Ramachandran conformations randomly, keeping partial segments as long as they can complete the loop with the remaining residues to be built [143-145] For longer loops, these methods seem to be much more promising, since they spend less time in unlikely conformations searched in the grid method. Many of these methods are based on Monte Carlo or molecular dynamics simulations with simulated annealing to generate many conformations which can then be energy minimized and tested with some energy function to choose the lowest energy conformation for prediction [143, 144, 146-148]. Several authors have developed Monte Carlo methods that draw y values from probability distributions derived from the PDB [149, 150] for other purposes, such as NMR structure refinement and ab initio peptide folding. These methods are promising because they are faster than exhaustive calculations, but have not been designed or tested on loop generation.

One important aspect in the development of a prediction method based on random (or exhaustive) construction of backbone conformations is the free energy function used to discriminate among those conformations which successfully bridge the anchors (see Table 5.2). Janardhan and Vajda [102] have found that a free energy function including a molecular mechanics term for the conformational energy, and a hydrophobic surface free energy term for the solvation is able to discriminate between decoy loops and real loops and to locate the correct conformation. Rapp and Friesner [104] reported recently that the AMBER94 force field and a generalized Born solvation model coupled with molecular dynamics and Monte Carlo simulations was able to regenerate conformations close to the crystal structure for one 8 and one 12 amino acid loop segment of ribonuclease. Their energy function was able to discriminate between good and poor conformations. Available programs

Apart from complete modeling packages (discussed below), only a small number of programs are freely and publicly available from the many methods that have been published (see Table 5.6). Among database methods, this includes only Swiss-PDBViewer [151], and BRAGI which implements the method of Fechteler et al. [128]. Several loop databases are available. Among construction methods, CONGEN is currently available [95]. CODA runs both a database and an ah initio construction method [152].

Side-chain modeling Input information

Side-chain modeling is a crucial step in predicting protein structure by homology, since side-chain identities and conformations determine the specificity differences in enzyme active sites and protein binding sites. The problem has been described as "solved" [153], although new methods [154] or improvements on older ones [94] continue to be published. Some side-chain prediction methods stand on their own, and are meant to be used with a fixed backbone conformation and sequence to be modeled given as input. Other methods have been developed in the context of general homol-ogy modeling methods, including the prediction of insertion-deletion regions. Even when using general modeling procedures, such as MODELLER, it may be worthwhile subsequently to apply a side-chain modeling step with other programs optimized for this purpose. This is especially the case when side-chain conformations may be of great importance to interpretation of the model. It is also often the case that insertion-deletion regions are far away from the site of interest, and loop modeling may be dispensed with. Indeed, significant alterations of the backbone of the template, if they are not closer to the target to be modeled (if it were known) than the template, may in fact result in poorer side-chain modeling than if no loop modeling were performed. As described above, the choice of the template may depend not only on sequence identity but also on the absence of insertions and deletions near the site of interest. If this is successful, side-chain modeling rises in importance in relation to loop prediction.

Side-chain prediction methods described in detail in the literature have a long history, although only a small number of programs are currently publicly available (see Tables 5.3 and 5.6). Nearly all assume a fixed backbone, which may be from a homologous protein of the structure to be modeled, or may be the actual X-ray backbone coordinates of the protein to be modeled. Many methods have in fact only been tested by replacing side chains onto backbones taken from the actual three-dimensional coordinates of the proteins being modeled ("self-backbone predictions"). Nevertheless, these methods can be used for homology modeling by substituting the target sequence onto the template backbone. When a protein is modeled from a known structure, information on the conformation of some side chains may be taken from the template. This is most frequently the case when the template and target residue are identical, in which case the template residue's Cartesian coordinates may be used. These may be kept fixed as the other side chains are placed and optimized, or they may be used only as a starting conformation and optimized with all other side chains. Only a small number of methods use information about non-identical side chains borrowed from the template. For instance, Phe ^ Tyr substitutions only require the building or removal of a hydroxyl group while Asn ^ Asp substitutions require changing one of 3 atoms from NH2 to O or vice versa. Summers and Karplus [155, 156] used a more detailed substitution scheme, where for instance the x1 angle of very different side-chain types (e.g., Lys ^ Phe) might be used in building side chains (see Appendix for definition of side-chain dihedral angles). In the long run, this is probably not advantageous, since the conformational preferences of non-similar side-chain types may be quite different from each other. Rotamers and rotamer libraries

Nearly all side-chain prediction methods depend on the concept of side-chain rotamers. From conformational analysis of organic molecules, it was predicted long ago [157, 158] that protein side chains should attain a limited number of conformations because of steric and dihedral strain within each side chain and between the side chain and the backbone (dihedral strain occurs because of Pauli exclusion between bonding molecular orbitals in eclipsed positions) [159]. For sp3-sp3 hybridized bonds, the energy minima for the dihedral are at the staggered positions that minimize dihedral strain at approximately 60°, 180°, and —60°. For sp3-sp2 bonds, the minima are usually narrowly distributed around +90° or —90° for aromatics and widely distributed around 0° or 180° for carboxylates and amides (e.g., Asn/Asp x2 and Glu/Gln /3).

As crystal structures of proteins have been solved in increasing numbers, a variety of rotamer libraries have been compiled with increasing amounts of detail and greater statistical soundness; that is, with more structures at higher resolution [160-169]. The earliest rotamer libraries were based on a small number of structures [160-163]. Even the widely used Ponder & Richards library was based on only 19 structures, including only 16 methionines [163]. The most recent libraries are based on over 600 structures with resolution of 1.8 AA or better and mutual sequence identity less than 50% between any two chains used.

Most rotamer libraries are backbone-conformation-independent. In these libraries, the dihedral angles for side chains are averaged over all side chains of a given type and rotamer class, regardless of the local backbone conformation or secondary structure. These libraries include two in common use in side-chain conformation prediction methods, that of Ponder & Richards and that of Tuffery et al. [165]. It should be noted that the Ponder-Richards library is based on a very small sample of proteins and should not be used for conformation prediction (which was not its intended use anyway). The Tuffery library is based on 53 structures, which is also a very small sample compared to the PDB now available. Kono and Doi also published a rotamer library based on a cluster analysis of 103 structures [170]. Richardson et al. have compiled rotamers recently from proteins with resolution better than 1.7A (compared to the more common 2.0A resolution cutoff), while excluding residues whose conformation is likely to be poorly determined, for example, partially disordered residues (high X-ray temperature factors) and those with unfavorable steric overlaps. This winnowing process leaves a smaller sample but one with more tightly clustered dihedral values [171].

Several libraries have been proposed that are dependent on the conformation of the local backbone [164, 166-169]. McGregor et al. [164] and Schrauber et al. [167] compiled rotamer probabilities and dihedral angle averages in different secondary structures. To my knowledge these libraries are not used in any available side-chain conformation prediction programs. We h

Continue reading here: Protein Structure Prediction

Was this article helpful?

0 0