Thomas Lengauer

Glossary of algorithmic terms in bioinformatics A* algorithm

A heuristic procedure for navigating in large search spaces that are formally described by graphs. The algorithm originated in the field of Artificial Intelligence in the 1960s (Hart et al., 1968). The algorithm finds a cheapest path in a graph from a start vertex to one among a selected set of goal vertices. As usual with path search algorithms the path is extended away from the start vertex step by step. At each step the next most promising vertex v to proceed to is selected. The selection is based on a scoring function that encompasses two terms

• a term amounting to the cost of the initial path segment from the start vertex to v

• a term estimating the cost of the optimal remaining path segment from v to the closest goal vertex.

If we require that the estimate underestimates the actual costs of the remaining path segment then we can prove that the A* algorithm always finds a cheapest path. If the estimator is always zero, then the path search amounts to a normal breadth-first traversal of the graph (Russell & Norvig, 1995).

In bioinformatics, the A* algorithm has been used in conformational search in docking a (rigid) ligand into a flexible protein structure ((Leach, 1994), see Chapter 7 of Volume I). The path assigns rotameric states to side chains in the protein binding pocket, one after the other. A goal vertex is one that corresponds to a complete assignment of rotameric states. The start vertex is the one where no rotameric states have been assigned.

The estimation of the cost of the optimal remaining path segment is performed by generalizing a process called dead-end elimination (Desmet et al., 1992). This process precomputes for each rotamer j of a side chain whether there is another rotamer k of the same side chain that leads to an

Bioinformatics - From Genomes to Drugs. Volume I: Basic Technologies. Edited by Thomas Lengauer Copyright © 2002 WILEY-VCH Verlag GmbH, Weinheim ISBN: 3-527-29988-2

energetically better conformation, independent of rotameric choices for other side chains. If so, rotamer j is excluded from the search. Leach (Leach, 1994) generalizes this process from just excluding alternatives to computing a lower bound on path cost by considering optimum rotamers independently for each side chain yet to be placed and adding up the corresponding cost terms. This extension, which Leach also adapted to the placement of side-chains in homology-based modeling (Leach & Lemon, 1998), is described in detail in Chapter 5 of Volume I.

Branch & bound

Search method in tree-structured search spaces (Cormen et al., 1990). The original problem instance corresponds to the root of the tree. Internal nodes of the tree are generated iteratively, beginning with the root, by decomposing a problem instance into alternatives. Each alternative becomes a child of the node representing the subdivided problem. The solution of any of the alternatives (children) solves the original problem. Problem instances that are not decomposed further correspond to leaf nodes of the tree. They can be solved directly without any further subdivision. The solution of the original problem then amounts to completing any path from the root to a leaf node. Each interior node represents a partial problem solution that amounts to the set of alternatives already decided upon that are represented by the nodes along the path.

In general the tree is very large but not very deep. The depth of the tree (the length of the longest path from a root to a leaf) corresponds to the number of decisions to be taken to assemble the solution, which is limited in general (linear or a small-degree polynomial in the problem size). In contrast, the number of leaves of the tree, i.e. the number of alternative solutions, is usually huge, often exponential in the problem size. Thus traversing the tree efficiently becomes a major problem.

Often, problem solutions (complete or partial) are labeled with costs. Then, an optimal solution is represented by a path from the root to a leaf with optimal cost. In general, in order to find an optimal solution, the whole tree has to be enumerated. Complete enumeration can be avoided if we can provide for each node v inside the tree a lower bound (for minimization, upper bound for maximization) on the cost of all leaves that are contained in the subtree rooted at v. If this bound exceeds the cost of any solution that we have found before in the tree traversal, the whole subtree rooted at v can be disregarded. The sharper the lower bound the larger the regions of the tree that can be eliminated in this branch & bound procedure.

When traversing the whole tree, a branch & bound algorithm is certain to produce an optimal solution. Even when stopped early, the algorithm generally has found some solution already that it can return. In addition, the lower bounds computed so far can be used to generate a lower bound for the optimal solution of the problem and thus a quality certificate for the solution returned. The longer the algorithm runs the better the solution it returns and the closer the lower bound to the cost of the optimal solution. In typical cases, a branch & bound algorithm that runs to completion will spend about 10% to 20% of its time finding the optimal solution. The remaining 80% to 90% are spent on completing the tree traversal in order to make sure that there is no better solution.

A branch & bound algorithm for protein threading has been introduced by Lathrop and Smith (Lathrop & Smith, 1996). Here, the alternatives amount to different alignments of secondary structure blocks in proteins, and the cost function estimates the energy of the resulting protein structure. The algorithm is described in detail in Chapter 6 of Volume I. A branch & bound algorithm for docking ligands into protein structures with flexible side chains has been described in (Althaus et al., 2000), see also Chapter 8 of Volume I. The ^ A* algorithm with dead-end elimination is a variant of a branch & bound algorithm, as described in Chapter 5 of Volume I (for side chain placement).

Cluster analysis

A basic technique for data classification. Cluster analysis is aimed at grouping items in a set into clusters. The item in each cluster should share some commonalities that justify to group them. The similarity or difference between different items is usually measured by a function of pairs of items. If we measure difference (similarity), the values decrease (increase) with increasing similarity between a pair of items.

Clustering procedures can be distinguished with respect to the structure of the item set. In bioinformatics, mostly, the items are points in Euclidean space of some dimension or else they are nodes in a graph or hypergraph.

Hierarchical clustering procedures iteratively partition the item set into disjointed subsets. There are top-down and bottom-up techniques. The top-down techniques partition can be into two or more subsets, and the number of subsets can be fixed or variable. The aim is to maximize the similarity of the items within the subset or to maximize the difference of the items between subsets. The bottom-up techniques work the other way around and build a hierarchy by assembling iteratively larger clusters from smaller clusters until the whole item set is contained in a single cluster. A popular hierarchical technique is nearest-neighbor clustering, a technique that works bottom up by iteratively joining two most similar clusters to a new cluster.

Algorithms for computing phylogenies are closely related to hierarchical clustering methods (Gusfield, 1997).

Nonhierarchical techniques do not build up a tree structure of clusters.

Probably the most popular nonhierarchical clustering technique groups the items - in this case points in Euclidean space - around a fixed number k of cluster centers. The clusters are first determined randomly and then the following two steps are iterated. First the center of each cluster is computed, then the items are reassigned to the cluster whose center is closest. The iteration proceeds until the clusters do not change any more. If we extend this method such as to recalculate cluster centers everytime that we reassign an item to a new cluster then we obtain the widely used k-means clustering algorithm (Kaufman & Rousseeuw, 1990).

In bioinformatics, clustering is used in the analysis of protein families -sequence families and multiple alignment (Chapter 2 of Volume I), structure families (Chapter 6 of Volume I), or families of proteins with the same function. Furthermore gene expression data are clustered with respect to similarity in the expression profiles (Chapter 5 of Volume II), clustering is used in grouping similar molecular conformations in protein modeling (Chapter 5 of Volume I) or molecular structures or interaction geometries in docking (Chapters 7 and 8 of Volume I). In proteomics, gel images are clustered (Chapter 4 of Volume II). Clustering is also a major approach to analyzing molecular similarity (Chapter 6 of Volume II).

Divide & conquer

One of the basic algorithmic techniques, which solves a problem by subdividing it into a usually small set of subproblems, solving those problems and then assembling their solution to a solution of the original problem (Cormen et al., 1990). The subproblems are solved in the same fashion, unless they can be solved directly without further subdivision. In contrast to the tree search described above under ^ branch & bound, in divide & conquer algorithms we need to solve not just one but all subproblems of a problem to solve the problem. The procedure is most effective, if the subdivision is balanced in the sense that all subproblems of the same problem have about equal size.

Divide & conquer algorithms for sequence alignment are described in Chapters 2 of Volume I and Volume II, applications to the analysis of molecular similarity are described in Chapter 6 of Volume II.

Dynamic programming

Tabular method of discrete optimization. The basis of dynamic programming is the principle of prefix optimality. This principle states that the optimal solution to the optimization problem can be composed of optimal solutions of a limited number of smaller instances of the same type of problem. The tabular method incrementally solves subinstances of increas ing size. The solutions of larger problem instances are assembled from previously computed solutions of smaller problem instances. The method stops when the original problem instance has been solved. The runtime for solving each subinstance tends to be constant or linear in the problem size. The number of subinstances can often be limited to a low-degree polynomial in the size of the problem instance (Cormen et al., 1990). For hard optimization problems this number may grow exponentially, however.

Sequence alignment (Chapter 2 of Volume I) is a prime example of a problem in bioinformatics that can be solved efficiently with dynamic programming. Dynamic programming for sequence alignment is also discussed extensively in (Gusfield, 1997; Waterman, 1995). Other examples of dynamic programming algorithms occur in Chapters 3 and 6 of Volume I and Chapter 2 of Volume II.

Energy minimization

Optimization method that aims at minimizing the energy of a molecular system that is usually described in the coordinates of the participating atoms. Since we usually deal with large molecules, the dimension of the space over which the energy is to be minimized is large (goes into the many thousands). Usually the energy function has a large number of local minima whose energy is close to or exactly the global minimum. In this context, finding the global minimum is very difficult (multi-minima problem). Depending on the type of energy function, different optimization methods are used. If we use local optimization methods that descend to the next local minimum we have to judiciously choose starting points for the optimization. Still, in the presence of many local minima, with local optimization we cannot hope to sample the solution space adequately to find the global minimum. Thus Monte Carlo methods such as ^ simulated annealing are employed that manage to escape from local minima.

Often the energy minimization is hampered by the fact that the energy function used is inaccurate. In addition, nature actually minimizes the free energy as opposed to the internal energy. Accurate computational models for free energy scoring require sampling of many conformations and are hence burdened by either substantial statistical error or extensive computing resource requirements. A particular problem is the incorporation of water, which makes large entropic contributions to the free energy. If water is modeled as a large set of discrete molecules large computing requirements result. The alternative is to model water as a continuum which, however, means another approximation.

Energy minimization is a voluminous field with dominating physico-chemical and thermodynamic aspects (Christensen & Jorgensen, 1997; Church & Shalloway, 2001; Trosset & Scheraga, 1998).

Fourier transform

A very important mathematical operation that transforms functions (continuous Fourier transform) or vectors (discrete Fourier transform) into the frequency domain (Bracewell, 1986). This means that the function is described in terms of the contributions of basic (sine) waves with different frequencies that compose the function. Besides its very fundamental physical relevance, the Fourier transform, especially its discrete version, has computational significance (Cormen et al., 1990). The reason is twofold. First, the Fourier transform of a vector of length n can be computed quite quickly, namely in time 0(n log n) by the so-called fast Fourier transform (FFT) algorithm (Cooley & Tukey, 1965). Second, the important but quite complex operation on vectors that is called convolution, can be reduced to the much easier pointwise multiplication, when transformed into the frequency domain. The convolution of two vectors of length n requires time 0(n2) when performed in a straightforward fashion. If we instead first Fourier transform the vectors, then pointwise multiply the result and then inverse Fourier transform, we spend 0(n log n) time doing the transforms and O(n) time for the multiplications, which amounts to a total of 0(n log n) time. The convolution operation is related to polynomial and integer multiplication and therefore these are operations that can be sped up by using the Fourier transform. In protein-protein docking (Chapter 8 of Volume I), the convolution operation comes up in the analysis of the steric complementarity of two rigid protein structures. Thus Fourier transform helps there, as well.

Genetic algorithm

A class of algorithms that mimics nature's way of optimization by variation and selection (Goldberg, 1989). Genetic algorithms work on populations of solutions to an optimization problem. Each solution (phenotype) is encoded in a string form called the chromosome (genotype). Chromosomes are varied by mutation only (evolutionary algorithm (Fogel et al., 1966)) or by mutation and cross-over, i.e. merging parts of two chromosomes in analogy to the recombination process taking place during meiosis (Holland, 1973). A selection function is applied to the resulting phenotypes. This process is iterated through a number of generations. Typical population sizes rank from 10 to a few hundred. Generally, a few dozen generations are computed. The key ingredients of the development of an evolutionary algorithm are the suitable choice of the chromosomal representation of the solutions, the mutation and cross-over operators and the diverse parameters governing the execution of the algorithm.

In bioinformatics, genetic algorithms are popular as a tool for fast prototyping. For this reason genetic algorithms can be found in practically all facets of bioinformatics. In contrast, exploiting all of the power that lies in the approach of genetic algorithms requires an expert, and is an involved task.

This book mentions genetic algorithms in Chapters 6 and 7 of Volume I. An overview of genetic algorithms in molecular recognition is given in (Willett, 1995). In addition, we mention here recent application of genetic algorithms to protein structure superposition (Szustakowski & Weng, 2000), protein folding (Schulze-Kremer, 2000), protein-protein docking (Gardiner et al., 2001), and drug selection from combinatorial libraries (Illgen et al., 2000).

Greedy algorithm

A heuristic optimization strategy that operates on the same kind of tree-structured optimization problem as the ^ branch&bound algorithm. Whereas branch&bound algorithms aim at covering the whole tree, greedy algorithms generate only one path from the root of the tree to a (hopefully) close to optimal leaf node. The path is extended from the root towards the leaf by selecting one alternative to proceed at each step. This selection is usually made heuristically. Therefore, we also speak of greedy heuristics. The notion "greedy" originates from the fact, that each decision for an alternative, when taken, is never reconsidered or taken back.

Since greedy algorithms generate only one path in the tree they are generally much faster than branch&bound algorithms, Therefore greedy algorithms are an alternative to branch&bound algorithms, whenever runtime efficiency is of prime importance or if no effective lower bounding procedures are available.

In contrast, greedy algorithms usually provide solutions of significantly lower quality than branch&bound algorithms. Also no quality certificates can be given for solutions computed by greedy algorithms Generally, greedy algorithms are used in an early stage of algorithm development. There is a subclass of optimization problem for which greedy algorithms exist that are guaranteed provide the optimal solution. These problems are called matroid problems (Cormen et al., 1990).

Greedy algorithms are explicitly or implicitly mentioned in many chapters of this book.

Gibbs sampler

A specific, in some sense the simplest Markov Chain Monte Carlo algorithm (^ Monte Carlo methods). In Gibbs sampling one samples the variables of a solution space one after the other. Each variable is drawn from a conditional distribution on the values of the other variables that were pre viously chosen. The conditional distributions must be accessible. The order in which variables are sampled one by one or in groups differs among different Gibbs sampling methods (Geman et al., 1984; Lawrence et al., 1993).

Hidden Markov models

Hidden Markov models (HMMs) are a popular variant of stochastic models for analyzing biomolecular sequences. HMMs are used for characterizing protein families (Baldi & Chauvin, 1994), (multiply) aligning protein sequences (Krogh et al., 1994), detecting remote protein homologies (Karplus et al., 1998), protein threading (Karplus et al., 1997), predicting signal peptides (Nielsen & Krogh, 1998) and transmembrane domains (Sonnhammer et al., 1998) in proteins, finding secondary structures in proteins (Asai et al., 1993), finding and characterizing promoters in DNA sequences (Pedersen et al., 1996), and gene finding (Henderson et al., 1997; Kulp et al., 1996).

An HMM is essentially a Markov chain (^ Monte Carlo methods). Each state inside the Markov chain can produce a letter, and it does so by a stochastic Poisson process that chooses one of a finite number of letters in an alphabet. Each letter has a characteristic probability of being chosen. That probability depends on the state and on the letter. After the state produces a letter, the Markov chain moves to another state. This happens according to a transition probability that depends on the prior and succeeding state.

We exemplify HMMs with Figure 1, which is taken from (Eddy, 1998). Figure 1 depicts a Markov model for protein sequence alignment. Sometimes this specific topology of a hidden Markov model is called a profile hidden Markov model There is a specific start state b and a specific final state e. The Match states (m1 to m3) represent sequence positions. The Insert

A profile hidden Markov model (with permission of Oxford University Press).

A profile hidden Markov model (with permission of Oxford University Press).

(i0 to i3) and Delete (d1 to d3) states allow for inserting or jumping over sequence positions. Above each match state and below each insert state is a histogram of the probabilities with which each letter from the alphabet of amino acids is generated by the model. The model has been trained on the five three-letter sequences shown on the left.

The length of the hidden Markov model is the number of match states. A profile HMM stands for a family of - in this case protein - sequences. Basically there are three algorithmic problems associated with an HMM.

1. Training: Set the probabilities of an HMM such that the protein sequences you want the protein family modeled by the HMM to belong to are generated with highest probability possible (Baum et al., 1970; Rabiner, 1989). Special measures are taken to avoid overfitting (see Reg-ularization below). Training is by far the most complex of the three problems. It amounts to a minimization in a high-dimensional space with an irregular cost function with many minima. General solution methods boil down to local optimization procedures combined with a judicious choice of a limited set of starting solutions. Much progress can still be made on this problem.

2. Classification: Given a trained model and a protein sequence, evaluate the probability with which the model generates the sequence. If this probability is above a suitable threshold predict that the sequence belongs to the family modeled by the HMM. Otherwise predict the opposite. For profile HMMs and other special cases of HMMs this problem can be solved in linear time in the length of the model with ^ dynamic programming methods. In general the algorithm requires time 0(n2t) where n is the number of states in the HMM and t is the length of the sequence to be generated by the HMM (Rabiner, 1989).

3. Alignment: Given a trained model and a protein sequence, determine, which path through the model is the most probable to generate the sequence. In the case of profile HMMs this path represents an alignment of the sequence to the model. The alignment can also be done in linear time in the length of the model with ^ dynamic programming methods (Forney, 1973; Rabiner, 1989). Thus, by aligning many sequences to the model one after the other we can generate a multiple alignment in linear time in the length of the model times the number of sequences. This is in contrast to many other methods of multiple sequence alignment whose runtime grows exponentially in the number of sequences. Of course, HMMs transfer much of this complexity into the training phase.

A very nice general tutorial of HMMs is given in (Rabiner, 1989). An extension of HMMs from finite automata (which is the deterministic version of a Markov chain) to context-free languages leads to the concept of a stochastic context-free grammar. Such grammars have been used in an analogous way to predict RNA secondary structures (Grate et al., 1994).

Molecular dynamics

In molecular dynamics methods the equations of motion of a molecular assembly are integrated numerically (Sprik, 1996). Thus we obtain not only a molecular structure but a trajectory of molecular motion. Due to high frequency atomic oscillations the time step for the numerical integration must be very small (about 10~15 sec). This results in extraordinarily high computing requirements even for computing trajectories that span a few nanoseconds. In contrast, proteins fold in a few seconds. This difference in time scale cannot be overcome by sheer computing power. Thus methods are under investigation that can smooth out the high frequency oscillations without invalidating the final outcome (Elber et al., 1999).

Molecular dynamics methods are primarily used for the refinement of structural models (Li et al., 1997) or the analysis of molecular interactions (Cappelli et al., 1996; Kothekar et al., 1998). In both cases the time scales to be simulated are within range of current computing technology. Another application area is the study of allosteric movements of proteins (Tanner et al., 1993). Molecular Dynamics approaches to protein-ligand docking are described in Chapter 7 of Volume I.

Monte Carlo methods

General class of algorithmic methods that involve a stochastic element, i.e., that let the computer make random decisions (Binder & Landau, 2000). An important subclass, the so-called Markov Chain Monte Carlo (MCMC) methods can be understood as acting on Markov chains. A Markov chain is a stochastic finite automaton that consists of states and transitions between states (Feller, 1968). At each time we consider ourselves as resident in a certain state of the Markov chain. At discrete time steps we leave this state and move to another state of the chain. This transition is taken with a certain probability characteristic for the given Markov chain. The probability depends only on the two states involved in the transition. Often one can only move to a state in a small set of states from a given state. This set of states is called the neighborhood of the state from which the move originates.

A Markov chain is ergodic if it eventually reaches every state. If, in addition, a certain symmetry condition - the so-called criterion of detailed balance or microscopic reversibility - is fulfilled, the chain converges to the same stationary probability distribution of states, as we throw dice to decide which state transitions to take one after the other, no matter in which state we start. Thus, traversing the Markov chain affords us with an effective way of approximating its stationary probability distribution (Baldi & Brunak, 1998).

Monte Carlo algorithms pervade bioinformatics. Especially the chapters dealing with molecular structures mention this class of algorithms fre-

Neuron j in a neural network

quently, for instance, to generate diverse ensembles of molecular structures or to perform energy minimization. ^ Simulated annealing is a special instance of a Monte Carlo algorithm. Chapter 7 of Volume I contains a detailed overview of Monte Carlo approaches to protein-ligand docking.

Neural network

A computational connection pattern that is formed in (simplified) analogy to biological (cortical) neural networks and used successfully for the classification of unstructured data. Neural networks have a broad range of applications in biology and chemistry (Schneider & Wrede, 1998).

In general, a neural network is a directed graph with nodes called neurons (see Figure 2). There may be a directed edge (i, j) from each neuron i to each neuron j. The directed edges (i, j) of the network are labeled with numbers Wij called weights. Each neuron is equipped with an activation function a. Neural networks are distinguished via the type of activation function they use. The more powerful class of neural networks uses nonlinear, e.g., sigmoid activation functions like a(x) = 1/(1 + e~x).

The neural network operates as follows. It is presented numbers x1,..., xn at distinguished input nodes. Each neuron applies its activation function to combine the numbers presented on its incoming edges and propagate the result along its outgoing edges. Specifically, the neuron j with, say, k incoming edges from neurons i1,..., % computes the value X = a(wiu jXil + ■■■ + wik, jXik). The numbers produced at the outputs of the network are the result of the computation of the neural network. Data classification can be performed, e.g., by thresholding the single output of a neural network or presenting a separate output for each class and comparing the numbers produced at each of a set of distinguished output nodes.

Neural networks can be distinguished by their topology. Layered feedforward networks, the most popular type, are acyclic and organized in layers of neurons. They always produce outputs. The class of functions they can compute increases with the number of layers of neurons. In contrast, neural networks with cycles are more complicated. They can become unstable or oscillate or exhibit chaotic behavior.

The training procedure for neural networks optimizes the weights along the arcs of the network in order to optimally reproduce the function to be learned. This function is implicitly represented by the training data. Here we have to take care to avoid overfitting (^ Regularization). The training problem is hard (^ hidden Markov models) and the training procedure approaches the problem heuristically, usually by performing some type of local optimization (Russell & Norvig, 1995).

The application of neural feed-forward networks has proved effective in protein 2D structure prediction (see Chapter 6 of Volume I, (Rost et al.,

1994)). Other applications of neural networks in bioinformatics include the recognition of protein sequence features such as transmembrane domains (Jacoboni et al., 2001) and recognition of features in DNA sequences (promoters (Matis et al., 1996; Nair et al., 1995), coding regions (Cai & Chen,

1995) etc.) as well as the interpretation of NMR spectra for protein structure resolution (Hare & Prestegard, 1994) and the discrimination between druglike compounds and those that are not drug-like (Sadowski & Kubinyi, 1998).

A particularly interesting kind of neural network for bioinformatics is the Kohonen map (Kohonen, 1990). This is a two-dimensional field of artificial neurons that affords the projection of adjacencies in high-dimensional Euclidean spaces down to two dimensions. Among other things, Kohonen maps are used to classify sets of organic compounds (Anzali et al., 1996; Cai et al., 1998; Kirew et al., 1998), to distinguish between different protein binding sites (Stahl et al., 2000), to investigate protein families (Ferran et al., 1994) or aspects of protein structure (Ding & Dubchak, 2001; Schuchhardt et al., 1996), to analyze the codon usage in different bacterial genomes (Wang et al., 2001) and to cluster gene expression data (Tamayo et al., 1999).


There is a general problem with methods that learn structure from limited amounts of training data, such as ^ hidden Markov models, ^ neural networks or ^ support vector machines mentioned in this glossary. The problem is called overfitting. If we carry the training procedures mentioned in the respective sections to their extreme, we will memorize all that we have seen in the training data, including the regularities that generalize to other data for which we want to make predictions in the future, and the irregularities that are specific to the training data and carry no information on data seen in the future. In order to avoid overfitting, the training data have to be supplemented with information that accounts for the lack of our knowledge on the data yet to be seen. In some sense, the classification function to be learned is smoothed out. It takes the training data into account, but also the lack of information on what is yet to come. Regulariza-tion is an involved concept that can be approached with theory (Vapnik, 2000). In practice, regularization is often performed heuristically, sometimes even in an ad hoc fashion.

The regularization procedure takes a different form for each method of statistical learning. When training hidden Markov models, the derived probabilities do not only take the observed sequences into account but also use so-called prior distributions that formulate some hypothesis on the occurrence of output characters in the case that we have no additional information (such as the observed sequences). An expected background distribution of amino acids or nucleotides is a natural starting point for such a prior distribution.

For ^ neural networks, a common implementation of regularization is to not carry the training procedure to the end but stop sometime before. ^ Support Vector Machines are automatically regularized by maximizing the margin of separation of the two classes of training data. The amount of regularization can be directly controlled by the parameter that guides the tradeoff between correct classification of the training data and the size of the margin.

Simulated annealing

Another Markov Chain Monte Carlo (MCMC) algorithm. Here each state of the Markov chain is labeled with a "cost" or "energy". Moves to neighboring states are accepted if they involve a decrease in cost. If they involve an increase in cost, they are only accepted with a certain probability. This probability is determined by the Metropolis criterion e~clkT, where c is the increase in cost, T is a temperature parameter and k is the Boltzmann constant (Metropolis et al., 1953). Tstarts out at a finite initial temperature and is then decreased according to a so-called cooling schedule until it approaches zero. This means that, as the algorithm proceeds, increases in cost become less and less likely. Partly because of its paradigmatic relationship with statistical physics, simulated annealing is a popular version of an MCMC algorithm. The algorithm has the facility of escaping from local minima reached during execution of the algorithm. It is frequently used in global optimization (Kirkpatrick, 1984). Simulated annealing cooling schedules are also very popular with Molecular Dynamics, e.g. for structural refinements using experimental constraints (Brunger et al., 1997) or for generating and refining molecular conformations in homology-based protein modeling (Chapter 5 of Volume I). Simulated annealing approaches for protein-ligand docking are described in Chapter 7 of Volume I.

Support vector machine

Support vector machines are a technology developed by Vapnik and others for classifying unstructured data (Burges, 1998; Christianini & Shawe-Taylor, 2000) that has gained much attention in bioinformatics lately. Support vector machines classify points in multi-dimensional Euclidean space into two classes. They extend linear classifiers (see Figure 3 a) that classify the points by placing a hyperplane into the space that separates the space into two half spaces. Training the classifier amounts to selecting the hyperplane such that it best fits a subset of preclassified points. This means that the smallest number of preclassified points comes to lie in the wrong half space. In fact, this is not quite the case, since we want to avoid overfitting to the training data (^ Regularization).

Support Vector Machines: (a) linearclassifier, (b) nonlinear classifier, (c) nonlinear classifier mapped to a high-dimensional space, where the classification becomes linear. Crosses and circles denote pre-classified points on which the support

Support Vector Machines: (a) linearclassifier, (b) nonlinear classifier, (c) nonlinear classifier mapped to a high-dimensional space, where the classification becomes linear. Crosses and circles denote pre-classified points on which the support

Continue reading here: Negative

Was this article helpful?

0 0