Tertiary structurebased methods

Structure-based methods are reviewed in [185] and can be classified into two types: profile methods based on dynamic programming (see Chapter 2 of this volume) for computing optimal alignments of structural environments and threading approaches using full pair-interaction potentials [30, 125, 138, 186] (see Section 6.2). The latter methods use optimal exhaustive search procedures [175], branch and bound approximation algorithms [187, 188] or other heuristic approaches for the optimization [189, 190]. Reviews and comparison of threading methods can be found in [186, 191, 192].

Profile threading methods As described in Section 6.2, algorithms that optimize threading profiles are the same as for pairwise sequence alignment and are based on dynamic programming. These methods have been introduced by Eisenberg et al. [136, 193, 194].

Several variants using multiple sequence information and sequence profiles generated for the query sequence for the threading process have been described and demonstrate to improve fold recognition performance [109, 195-203] similarly as has been observed for secondary structure prediction and sequence methods.

Combining secondary structure prediction with profile threading [165, 204] has been demonstrated to improve threading performance as well.

As methods based on dynamic programming cannot account for pair-wise-interaction potentials the so-called frozen approximation approach [189] has been proposed. This method performs several iterations of profile environments. In the first iteration, the chemical environment is defined via the contact partners of the template. In subsequent rounds the aligned residues from the previous iteration replace the residues of the template. The idea is that target and template structure are similar enough such that the iterative process converges towards the optimal assignment.

A method employing 3D structural profiles for detecting remote homologies has been proposed by Sternberg et al. [205]. This method, called 3D-PSSM (three dimensional position specific scoring matrix), uses sequence profiles to search databases. The profiles are computed from multiple sequence alignments using the standard method used for generating PSI-BLAST profiles [9, 206]. In order to use structural information, the following approach has been proposed (see Figure 6.4): for any protein f in the fold library, determine the set S of related structures in the superfamily. For any sequence x in S compute the PSI-BLAST profile via a sequence representative chain of superfamily

SCOP superfamily

SCOP superfamily representative chain of superfamily

ID PSSM of Query sequence

PSI-Blast profile of query sequence query sequence

Overview of the 3D-PSSM approach for protein structure prediction: For any protein (superfamily) PSI-BLAST searches and profiles are computed and using the structural superposition of the (structurally related) family members. On the basis of

ID PSSM of Query sequence

PSI-Blast profile of query sequence query sequence

Overview of the 3D-PSSM approach for protein structure prediction: For any protein (superfamily) PSI-BLAST searches and profiles are computed and using the structural superposition of the (structurally related) family members. On the basis of the supersposition a structural profile representing the family is compiled. Given a query sequence, a corresponding sequence profile is computed, again using PSI-BLAST, and both profiles are aligned using a dynamic programming based profile threading method.

search of x against an appropriate database yielding a profile Px. The profiles for the sequences in S are then combined using the structural alignments for the superfamily and the implied equivalences of residues. This is possible, despite the fact that the proteins in S do not necessarily show a reliable sequence similarity. The reason is that these proteins are structurally superposable and thus reliable residue equivalences can be determined. The idea behind the approach is to use all the information available for a group of structurally similar proteins in order to do the sequence structure (threading) search. The fold library and the associated profiles can be computed beforehand such that the search for a query protein can efficiently be done via computing a sequence profile and matching this profile against all 3D-PSSMs from the library using standard dynamic programming methods. During this search, in addition, solvent accessibility and secondary structure preferences are evaluated. The secondary structure prediction is performed with the program PsiPRED [156]. Thus, the approach successfully combines evolutionary and structural information available in the sequence and structure databases using the most advanced search (PsiPRED) and prediction methods (PsiPRED) as well as established profile building methods (PSI-BLAST) and structure classifications (SCOP). The major disadvantage of the method appears to be that there is probably not one single profile representing the properties of any individual member of a protein super-family. Nevertheless, the performance of the method has been shown to improve on sequence search alone on a benchmark of 136 homologous protein pairs that cannot be detected with PSI-BLAST (see [205] for details of the benchmark construction and the results obtained). Here, 3D-PSSM was able to detect 18% of the pairs. The program is available as a web service at the Biomolecular Modeling Laboratory of the Imperial Cancer Research Fund [205]. In the latest CASP-4 and CAFASP-1 experiments both this approach and the associated automated prediction server were among the most successful groups.

Optimal threading with full pairwise/interaction potentials via exhaustive enumeration Usually, algorithms aiming at optimal solutions address the fragment threading problem: here, the structure is partitioned into a number k of secondary structure elements, which are considered fixed fragments of a definite length where no gaps are allowed in any threading alignment. The problem reduces to finding a mapping of sequence positions onto the first positions of all the fixed fragments of the structure. This mapping has to observe the linear ordering of the sequence residues, the structure fragments and the loop length constraints. The basic assumption behind this model for sequence-structure relation is that the structure is composed of important and irrelevant parts, the important parts are the fragments, which have to be aligned to constitute the essential parts of the fold and the irrelevant part are loop regions, which are usually not conserved even among similar folds and, thus, need not be aligned and the alignment score is not taken into account for the overall alignment score.

The fragment threading problem has a much smaller search space of size exponential in n and k instead of n and m for protein threading. In some cases, for small proteins or proteins with a small number of fragments an exhaustive enumeration of all mappings of sequence regions onto fragments is possible. Therefore, the optimal threading can be found for almost any conceivable scoring function, i.e. scoring functions that depend on the complete alignment. Nevertheless, to find meaningful threadings capturing shared biological and/or structural features in a true sequence-structure relationship between the two proteins, the scoring function has to model the respective features very accurately. As of yet, no scoring function appears to be accurate enough to discriminate between similar alignments and to select the 'correct' alignment from a large number of alternative alignments. For a broad class of empirical pair interaction potentials it has actually been demonstrated that almost always there exist alignments in the vicinity of the structurally closest alignments, that score better than the standard-of-truth according to the scoring function used. Such alignments are easily found via a greedy search of the alignment space or some simple perturbations of the correct alignment using a genetic algorithm [207].

Branch-and-bound algorithms for fragment threading An algorithm for finding the global optimum of gapped threadings between a query sequence and a structural core of consecutive fragments (fragment threading [208]) has been described by Lathrop and Smith [175]. The fragment threading formalization is contrasted to the full threading approach as addressed by recursive dynamic programming (RDP, see below). The method can deal with quite general scoring functions including full pair interaction potentials and imposes only some (length) restrictions on the loop regions. The method deals with sets of large numbers of threadings at a time using an efficient representation and estimates appropriate bounds for the maximum achievable score of all the alignments in the sets. Thus, large numbers of potential threadings can be eliminated by applying one of several bounding conditions. Alignments in the fragment threading formalization can be represented by defining sequence positions to be mapped onto the first structural position of any core element, i.e. a vector {t1y..., tn) for n Core elements C1,..., Cn. The amino acid chain and loop restrictions define additional lower and upper bounds for the amino acid to be placed onto the start of a structural segment. A set of alignments is than represented as a vector of lower (hi) and upper bounds (di) for the segments Ci and contains all threadings compatible with the constraint that the amino acid aligned to Q(l) is in the range (hi, di). The algorithm enumerates all possible threadings by selecting such a set and a segment and a so called split point t in (hi, ti) via recursively producing three subsets (hi, t - l), (t), and (t + l, di) and computing the associated bounds on the score of the subsets. If the set of alignments cannot improve the currently best alignment score known, the whole set can be discarded. The algorithm guarantees to come up with the optimal threading alignment [175], given enough time. If stopped early, the algorithm will present the best scoring alignment found so far. Even after an optimal alignment has been found, the algorithm can be continued in order to produce alternative optimal alignments solutions or present the proof that the alignment is optimal, which can only be stated for certain, once the whole search space has been covered. Typically, it takes about ten times as long to provide the proof as compared to computing the optimal alignment.

The performance of the algorithm heavily depends on the stringency of the bounds of eliminated large sets of poorly performing solutions before actually evaluating them individually.

Considering the potentially huge search space, Lathrop and Smith report implicit evaluation of up to 6.8 * 1028 alignments per second [175] (Table 6.1), which compares with about 140 explicit evaluations of the scoring function as observed by Bryant [209].

Despite the fact that, for a set of threading instances, the optimal alignment could be produced for a set of five commonly used scoring potentials, the relevance of globally optimal solution remains doubtful from a biological point of view, due to apparent deficiencies in the used scoring systems. Nevertheless, the approach demonstrated that optimal solutions can be produced for biologically relevant problem instances with, in some cases, structurally meaningful alignments that exhibit small deviations from the standard-of-truth alignment. Whether the algorithmic optimization or the tuning of the scoring functions is the most decisive factor for obtaining high quality alignments with a reasonable amount of computing resources remains to be seen.

Recently, Xu and Xu proposed a new branch-and-bound algorithm called PROSPECT [210-212] for the full threading problem which guarantees to find the globally optimal alignment for a scoring function incorporating pairwise contact terms (in addition to sequence terms, environment profile scores and gap penalties). The algorithm can avoid enumerating the complete search space by using appropriate bi-partitionings of the sequence and structure, such that topological complexity (TC) is minimized, i.e. the parts of the partitions exhibit a minimum number of inter-partition links (contacts). As all legal combinations of link assignments have to be considered the TC is an important factor determining the running time of the algorithm. In biological examples, the TC has been observed to be small (<8) in practice, and the algorithm has a time complexity of 0(nTC ■ nTC!2), where n is the target length and N the maximum allowed loop length (in the current version N < 20). Again, including predicted secondary structure information improves the recognition performance of the approach. The PROSPECT method allows exploiting constraints on the target protein such as known disulfide bonds, active site information and long-range (NOE/NMR) distance constraints. The method has been evaluated on benchmark sets [210]

Continue reading here: Tab

Was this article helpful?

0 0