# Sequence alignment

A sequence alignment [38] is a scheme of writing one sequence on top of another where the residues in one position are deemed to have a common

Dot plot comparing two hemoglobin sequences.

evolutionary origin. If the same letter occurs in both sequences then this position has been conserved in evolution (or, coincidentally, mutations from another ancestral residue has given rise to the same letter twice). If the letters differ it is assumed that the two derive from an ancestral letter, which could be one of the two or neither. Homologous sequences may have different length, though, which is generally explained through insertions or deletions in sequences. Thus, a letter or a stretch of letters may be paired up with dashes in the other sequence to signify such an insertion or deletion. Since an insertion in one sequence can always be seen as a deletion in the other one sometimes uses the term 'indel', or simply 'gap'.

In such a simple evolutionarily motivated scheme, an alignment mediates the definition of a distance for two sequences. One generally assigns 0 to a match, some positive number to a mismatch and a larger positive number to an indel. By adding these values along an alignment one obtains a score for this alignment. A distance function for two sequences can be defined by looking for the alignment which yields the minimum score.

Naively, the alignment that realizes the minimal distance between two sequences could be identified by testing all possible alignments. This number, however, is prohibitively large but luckily, using dynamic programming, the minimization can be effected without explicitly enumerating all possible alignments of two sequences. To describe this algorithm [39], denote the two sequences by s = s1,..., sn and t = t1,..., tm. The key to the dynamic programming algorithm is the realization that for the construction of an optimal alignment between two stretches of sequence s1,..., s; and t1,..., tj if suffices to inspect the following three alternatives:

(i) the optimal alignment of s1,..., s;_1 with t1,..., tj—1, extended by the match between s; and tj;

(ii) the optimal alignment of s1,..., s;_1 with t1,..., tj, extended by matching s; with a gap character '—';

(iii) the optimal alignment of s1,..., s; with t1,..., tJ—1, extended by matching a gap character '—' with tj.

Each of these cases also defines a score for the resulting alignment. This score is made up of the score of the alignment of the so far unaligned sequences that is used plus the cost of extending this alignment. In case: (i) this cost is determined by whether or not the two letters are identical and in cases (ii) and (iii) the cost of extension is the penalty assigned to a gap. The winning alternative will be the one with the best score (Figure 2.3).

To implement this computation one fills in a matrix the axes of which are annotated with the two sequences s and t. It is helpful to use north, south, west, and east to denote the sides of the matrix. Let the first sequence extend from west to east on the north side of the matrix. The second sequence extends from north to south on the west side of the matrix. We want to fill the matrix starting in the north-western corner, working our way southward

Schematic representation of the edit matrix comparing two sequences. The arrows indicate how an alignment may end according to the three cases described in the text.

Schematic representation of the edit matrix comparing two sequences. The arrows indicate how an alignment may end according to the three cases described in the text.

row by row, filling each row from west to east. To start one initializes the northern and western margin of the matrix, typically with gap penalty values. After this initialization the above rules can be applied. A cell (i, j) that is already filled contains the score of the optimal alignment of the sequence Si,..., Si with t1,..., tj. The score of each such cell can be determined by inspecting the cell immediately north-west of it (case (i)), the cell west (case (ii)), and the one north (case (iii)) of it and deciding for the best scoring option. When the procedure reaches the south-eastern corner that last cell contains the score of the best alignment. The alignment itself can be recovered as one backtracks from this cell to the beginning, each time selecting the path that had given rise to the best option.

The idea of assigning a score to an alignment and then minimizing or maximizing over all alignments is at the heart of all biological sequence alignment. However, many more considerations have influenced the definition of the scores and made sequence alignment applicable to a wide range of biological settings. Firstly, note that one may either define a distance or a similarity function to an alignment. The difference lies in the interpretation of the values. A distance function will define positive values for mismatches or gaps and then aim at minimizing this distance. A similarity function will give high values to matches and low values to gaps and then maximize the resulting score. The basic structure of the algorithm is the same for both cases. In 1981, Smith and Waterman [40] showed that for global alignment, i.e., when a score is computed over the entire length of both sequences, the two concepts are in fact equivalent. Thus, it is now customary to choose the setting that gives more freedom in appropriately modeling the biological question that one is interested in.

In the similarity framework one can easily distinguish among the different possible mismatches and also among different kinds of matches. For example, a match between two Tryptophanes is usually seen to be more important than a match between two Alanines. Likewise, the pairing of two hydrophobic amino acids like Leucine and Isoleucine is preferable to the pairing of a hydrophobic with a hydrophilic residue. Scores are used to de scribe these similarities and are ususally represented in the form of a symmetric 20 by 20 matrix, assigning a similarity score to each pair of amino acids. Although easy to understand from the physical characteristics of the amino acids, the values in such a matrix are usually derived based on an evolutionary model that allows one to estimate whether particular substitutions are preferred or avoided. This approach has been pioneered by M. Dayhoff [41] who computed a series of amino acid similarity matrices. Each matrix in this series corresponds to a particular evolutionary distance among sequences. This distance is measured in an arbitrary unit called 1 PAM, for 1 Accepted Point Mutation (in 100 positions). The matrices carry names like PAM120 or PAM250 and are supposed to be characteristic for evolutionary distances of 120 or 250 PAM, respectively. Other more recent series of matrices are the BLOSUM matrices [42] or Gonnet's [43] matrices. For every matrix one needs to find appropriate penalties for gaps.

The treatment of gaps deserves special care. The famous algorithm by Needleman and Wunsch [44] did not impose any restrictions on the penalty assigned to a gap of a certain length. For reasons of computational speed this was later specified to assigning a cost function linear in the number of deleted (inserted) residues [45]. This amounts to penalizing every single indel. However, since a single indel tends to be penalized such that it is considerably inferior to a mismatch, this choice resulted in longer gaps being quite expensive and thus unrealistically rare. As a remedy, one mostly uses a gap penalty function which charges a gap open penalty for every gap that is introduced and penalizes the length with a gap extension penalty which is charged for every inserted or deleted letter in that gap. Clearly, this results in an affine linear function in the gap length, frequently written as g(k) = a + b * k [46].

With the variant of the dynamic programming algorithm first published by Gotoh [47] it became possible to compute optimal alignments with affine linear gap penalties in time proportional to the product of the lengths of the two sequences to be aligned. This afforded a speed-up by one order of magnitude compared to a naive algorithm using the more general gap function. A further breakthrough in alignment algorithms development was provided by an algorithm that could compute an optimal alignment using computer memory only proportional to the length of one sequence instead of their product. This algorithm by Myers and Miller [48] is based on work by Hirschberg [49].

Depending on the biological setting several kinds of alignment are in use. When sequences are expected to share similarity extending from the beginning of the sequences to their ends they are aligned globally. This means that each residue of either sequence is part either of a residue pair or a gap. In particular it implies that gaps at the ends are charged like any other gap. This, however, is a particularly unrealistic feature of a global alignment. While sequences may very well share similarity over their entire length (see the example dot plot of two hemoglobin chains in Figure 2.2), their respec tive N- and C-termini usually are difficult to match up and differences in length at the ends are more of a rule than an exception. Consequently, one prefers to leave gaps at the ends of the sequences un-penalized. This variant is easy to implement in the dynamic programming algorithm. Two modifications are required. Firstly the initialization of the matrix needs to reflect the gap cost of 0 in the margin of the matrix. Secondly, upon backtracking, one does not necessarily start in the corner of the matrix but much rather searches the margins for the maximum from which to start. Variants of this algorithm that penalize only particular end-gaps are easy to derive and can be used, e.g., to fit one sequence into another or to overlap the end of one sequence with the start of another.

In many cases, however, sequences share only a limited region of similarity. This may be a common domain or simply a short region of recognizable similarity. This case is dealt with by so-called local alignment in an algorithm due to Smith and Waterman [50]. Local alignment aims at identifying the best pair of regions, one from each sequence, such that the optimal (global) alignment of these two regions is the best possible. This relies on a scoring scheme that maximizes a similarity score because otherwise an empty alignment would always yield the smallest distance. Naively, the algorithm to compute a local alignment would need to inspect every pair of regions and apply a global alignment algorithm to it. The decisive idea of Smith and Waterman was to offer the maximization in each cell of the matrix a fourth alternative: a zero to signify the beginning of a new alignment. After filling the dynamic programming matrix according to this scheme, backtracking starts from the cell in the matrix that contains the largest value.

Upon comparing a dot plot and a local alignment one might notice regions of similarity visible in the dot plot but missing in the alignment. While in many cases there exist gap penalty settings that would include all interesting matching regions in the alignment, generally it requires the comparison with the dot plot to notice possible misses. This problem is remedied by an algorithm due to Waterman and Eggert [51] which computes suboptimal, local, and non-overlapping alignments. It starts with the application of the Smith-Waterman algorithm, i.e., a dynamic programming matrix is filled and backtracking from the matrix cell with the largest entry yields the best local alignment. Then the algorithm proceeds to delineate a second-best local alignment. Note that this cannot be obtained by backtracking from the second-best matrix cell. Such an approach would yield an alignment largely overlapping the first one and thus containing little new information. Instead, those cells in the dynamic programming matrix are set to zero from where backtracking would lead into the prior alignment. This can be seen as "resetting" the dynamic programming matrix after having deleted the first alignment. Then the second best alignment is identified by looking for the maximal cell in the new matrix and starting backtracking from there. Iteration of this procedure yields one alternative,

2.4 Database searching I: single sequence heuristic algorithms | 39

non-overlapping alignment after the other in order of descending quality. Application of this algorithm avoids possibly missing matching regions because even under strong gap penalties the procedure will eventually show all matching regions.

There is an interesting interplay between parameters, in particular the gap penalty, and the algorithmic variant used. Consider a pair of sequences whose similar regions can in principle be strung together into an alignment (as opposed to sequences containing repeats which cannot all be seen in a single alignment). Under a weak gap penalty the Smith-Waterman algorithm has a chance to identify this entire alignment. On the other hand, not knowing about the similarity between the sequences ahead of time, a weak gap penalty might also yield all kinds of spurious aligned regions. The Waterman-Eggert algorithm is a valid alternative. The gap penalty can be chosen fairly stringently. The first (i.e., the Smith-Waterman) alignment will then identify only the best matching region out of all the similar regions. By iterating the procedure, though, this algorithm will successively identify the other similar regions as well. For a detailed discussion of these issues see Vingron and Waterman [52].

Continue reading here: Database searching I single sequence heuristic algorithms