actln, H=0


SOD, H=0.991


NFM, H=0.986

GRalpha4, H=1.585


| 2 ■ JM-

■2 3 1 2

> M U


123456789 Tims

123456789 Time

123456789 Tims

123456789 Tims

Shannon entropy applied to gene expression time series. The equation for Shannon entropy (H) is shown above. For each gene, entropy accounts for the probability, p (or frequency), of a level of gene expression, i. Data from [8] (see Figure 5.1) are binned into three expression levels. SOD, superoxide dismutase; NFM, neurofilament medium; GRalpha4, GABA receptor alpha 4 subunit. Actin has zero entropy (zero information content) because its expression is invariant over the nine time points. GRalpha4 has the highest entropy because it exhibits all three levels of expression, spread out as evenly as possible over the time series (three at level 1, three at level 2, and three at level 3). A bias toward any level will reduce the entropy, as in the case of SOD (biased toward level 2) and NFM (biased toward level 1). Entropy accounts only for the frequencies of expression values, not their positions in the time series; therefore, the shapes of these patterns have no effect on entropy.

Shannon entropy applied to gene expression time series. The equation for Shannon entropy (H) is shown above. For each gene, entropy accounts for the probability, p (or frequency), of a level of gene expression, i. Data from [8] (see Figure 5.1) are binned into three expression levels. SOD, superoxide dismutase; NFM, neurofilament medium; GRalpha4, GABA receptor alpha 4 subunit. Actin has zero entropy (zero information content) because its expression is invariant over the nine time points. GRalpha4 has the highest entropy because it exhibits all three levels of expression, spread out as evenly as possible over the time series (three at level 1, three at level 2, and three at level 3). A bias toward any level will reduce the entropy, as in the case of SOD (biased toward level 2) and NFM (biased toward level 1). Entropy accounts only for the frequencies of expression values, not their positions in the time series; therefore, the shapes of these patterns have no effect on entropy.

entropy can then be considered as possible drug targets for the condition being studied. In some cases, entropy may assign a low score to genes that could be considered good drug target candidates. For example, entropy will be low for genes that show a spike in expression at the beginning of a disease process, with invariant expression thereafter. There is also the possibility of false high entropy values from binning artifacts; these can be eliminated by visual inspection of expression patterns once a short list of putative drug targets has been selected. Most important, however, is that entropy provides a way of extracting essential information from an overwhelming number of gene expression profiles, allowing limited drug development resources to be focused on a small number of genes.

In neuroscience, we generally consider neurotransmitter receptors to be ideal drug targets, not only because of their accessibility, but also because of their functional specificity. The results of our entropy analysis [12] of the spinal cord data [8], are consistent with this assumption. Of the various functional categories of genes, the ionotropic neurotransmitter receptors are three-fold over-represented at the highest entropy level. At the second highest entropy level, the metabotropic neurotransmitter receptors are twofold over-represented. Growth factors (peptides) and their receptors (receptor tyrosine kinases), on the other hand, are over-represented at several low entropy values. This suggests that neurotransmitter receptors are the major participants in spinal cord development.

Entropy can also be used to select out toxicity or diagnostic markers. In Fuhrman et al. [12], using DNA microarray data we demonstrated that certain genes are affected by toxic doses of several compounds. As in the analysis of spinal cord data, certain functional categories of genes were over-represented at high entropy. Other genes from these high entropy functional categories can be used as a focus of further toxicological research.

Another way of applying entropy is in the analysis of anatomical data. This is particularly relevant in neuroscience given the complexity of central nervous system anatomy. With the help of microarrays and other large-scale assay techniques, it is now possible to assay thousands of genes in multiple brain regions. Entropy can be applied to these data by substituting anatomical regions for time points, providing a measure of anatomical complexity that will vary from one gene to another. For example, for nine brain regions and three levels of gene expression, a gene would have the highest possible entropy if it occurred at all three levels, with each level occurring in three regions (three 1s, three 2s, and three 3s). As with time series, the shape of the distribution makes no difference. In contrast, a gene that showed no variation in expression across the nine regions would have zero entropy. The high entropy gene could be considered a major player in determining the complexity of brain anatomy. However, even in the case of anatomical studies, the time component must be considered. A gene with low entropy in its anatomical distribution could have high entropy at a later time if the brain is changing as a result of disease, injury, or normal development. Anatomical studies are therefore best approached as time series, such as development.

The use of Shannon entropy for analyzing gene expression data is a departure from traditional approaches. First, it is intended for time series, not individual data points. Further, entropy ignores a point traditionally regarded as important in the interpretation of gene expression data: the n-fold increase or decrease. Entropy accounts for the complexity of a time series without regard to the distance between expression levels. For example, a gene that exhibits two levels of expression could go from level one to level three, while another could go from level two to level three, yet both can have the same entropy. In regard to this, it could be argued that absolute differences in mRNA concentrations are important biochemically, but do not necessarily reflect the relative importance of genes within the system as a whole. It is reasonable to assume that each gene functions normally at its respective concentration(s), and that differences in the quantities of mRNAs or proteins reflect differences in biochemical efficiency.

Methods such as Shannon entropy will be needed in future studies of degenerating and regenerating tissue. Entropy may be used to select out the most likely drug target candidates from data on animal models of Alzheimer's and other degenerative diseases. Similarly, we can study the regeneration of animal tissues that fail to regenerate in humans (e.g., spinal cord), and use entropy to determine which genes are the major effectors of regeneration. Some of these genes will undoubtedly be familiar - receptors for which drugs are already available - and could be tested with experimental therapeutic drugs in animal models.


Clustering is a method for grouping patterns by similarities in their shapes. This method is now being applied to large-scale gene expression patterns from eukaryotic cell cultures [15, 16, 17] and tissues [8, 17]. This technique can be used to suggest functions for newly discovered genes that have no known homologs, or to assign new functions to known genes. For example, if a new gene clusters with (has a temporal expression pattern similar to) a group of peptide receptors, the new gene may share a function with those receptors. In Figure 3, genes of unknown function are shown clustered in a dendrogram with familiar genes expressed during rat spinal cord development (also compare to wave 1 in Figure 5.1).

Before gene expression patterns can be clustered, the "distance" between each gene and every other assayed gene must be determined. This distance is based on the difference in expression level at each time point. So, as with Shannon entropy, the entire time series is taken into account. In Figures 5.1 and 5.3, the euclidean distance was used. Euclidean distance for a pair of genes is calculated by squaring the differences in expression levels between the two genes at all time points, adding the squares, and taking the square root of the sum, D = — bi)2. The completed calculations result in a

"mileage chart" for all the genes assayed, and serve as the inputs for a clustering algorithm.

Clustering is based on "distance," placing genes with highly similar patterns adjacent to each other in a dendrogram. By displaying the results in a dendrogram, one can quickly assess whether two or more genes have similar temporal expression patterns without having to search through large sets of graphs. Displays of this type become more important with very large data sets containing thousands of genes. Large cluster trees could be examined on a computer screen so that it will be possible to zoom in on a selected tree branch.

Euclidean distance is essentially a measure of positive, linear correlations; however, other similarity measures may be used for clustering. For example, mutual information, an information theoretic measure, may be used to capture positive, negative, and non-linear correlations all at the same time. A pictorial explanation of the concept of mutual information along with instructions on doing calculations can be found in [18]. Mutual information is based on Shannon entropy (H = —Lpi log2 p¡; see above explanation of entropy) and is calculated as follows: M(X, Y) = H(X) + H(Y) — H(X, Y),

Clustering of novel with known genes [8]. The ESTs (expressed sequence tags), SC6 and SC7 cluster with neurotransmitter-related genes. Upper right panel: SC6 has a temporal expression pattern similar to that of nAChRd (nicotinic acetylcholine receptor delta subunit). Lower left panel:

Clustering of novel with known genes [8]. The ESTs (expressed sequence tags), SC6 and SC7 cluster with neurotransmitter-related genes. Upper right panel: SC6 has a temporal expression pattern similar to that of nAChRd (nicotinic acetylcholine receptor delta subunit). Lower left panel:

SC7 shares a pattern with nestin and two splice variants of the neurotransmitter-metabolizing enzyme, GAD (glutamatic acid decarboxylase). Middle panel: These genes are shown in a branch of the cluster tree for 112 genes expressed in the developing rat spinal cord.

where H is Shannon entropy, and X and Y are the two gene expression patterns being compared. Let's say that gene X has a binary expression pattern 0111111000 (at each time point the gene is either on or off), and that gene Y has an expression pattern 0001100111. Because X has 40% 0s and 60% 1s, H(X) = -0.4 log(0.4) - 0.6 log(0.6) = 0.97. Y has 50% 0s and 50% 1s, so H(Y) = -0.5 log(0.5) - 0.5 log(0.5) = 1.00. H(X, Y) refers to the frequency at which each 0 or 1 in X corresponds to a 0 or 1 in Y at the same time points in the series. Zero corresponds to zero in 10% of the series, 1 in X corresponds to zero in Y in 40% of the series, zero in X corresponds to 1 in Y 30% of the time, and 1 corresponds to 1 20% of the time, so H(X, Y) = -0.1 log(0.1) -0.4 log(0.4) -0.3 log(0.3)-0.2 log(0.2) = 1.85. Finally, M(X, Y) = 0.97 + 1.00 - 1.85 = 0.12. Mutual information would be at a maximum if H were at a maximum for both X and Y (if both X and Y were 50% 0s and 50% 1s), and if there were a total consistency of matching between the two patterns (e.g., if X were 0011100110 and Y 1100011001).

The benefit of mutual information is that it accounts for the possibility that two or more genes may respond in different ways to the same genetic regulatory input. In other words, one gene may be up-regulated and another down-regulated even though both are receiving the same input. These two regulated genes may be considered functionally related despite their divergent responses. A comparison of agglomerative clustering using a mutual information and Euclidean distance measure is presented in [10]. In conclusion, the results obtained by clustering are largely determined by the distance measure used.

There are many algorithms for clustering. The most popular methods for gene expression analysis are agglomerative clustering and the k-means algorithm. In the agglomerative clustering method, one first defines the similarity measure between two gene expression profiles, which is usually the Euclidean distance of normalized gene expression. The algorithm groups genes most similar to each other in a pairwise fashion. First, two genes closest to each other among all pairs of genes are joined together to form a new node (the distance between the genes determines the length of the line joining them in the dendrogram). The distance between the new node and all other genes are updated. Next, the two genes that are closest to each other are selected and joined together. This process continues, each time the closest pair is chosen, until all the genes are exhausted. This process results in a tree with branches that provide a hierarchical classification for the gene expression data.

Another popular algorithm is the k-means method, which partitions genes into n clusters. Each gene is initially assigned to a cluster. This assignment is then adjusted by reassigning a gene to the cluster whose center is closest to it. Since the center of the cluster also changes, the process is iterative. One can determine the right number of clusters by performing this procedure with different numbers of clusters until a certain clustering quality criterion is optimized.

A criticism sometimes heard about this approach is that cluster boundaries are arbitrary. Naturally, someone must decide where on a dendrogram one cluster ends and another begins. These decisions are based on non-computational judgements, as are decisions about whether a particular clustering method provides a result that an observer would have produced by visual inspection alone. In that context, we have found that by including in our calculations the slopes between data points, the clustering results "improve" - the results are more similar to what one would expect based on visual examination of the expression patterns (unpublished observations).

One method for assessing the quality of a clustering method is to establish the correlations between expression clusters and known functional families, gene sequence families, or gene clusters from other experiments (Figure 5.1). If the clusters capture genes known to be related according to other functional criteria, one may conclude that the clustering algorithm has identified meaningful biological groupings. However, different clustering methods may provide differential matchings with alternative systems of classifying genes, e.g. sequence families, functional families, shared promoter elements (unpublished observations). The truth is that computational analysis of any kind is only a guide for interpreting data, with human judgment as the final arbiter.

Clustering allows us to form new hypotheses about gene function by grouping genes not previously associated. However, this technique alone is not sufficient for identifying new therapeutic drug target candidates. In that context, it may be useful to combine clustering with other analytical methods, such as Shannon entropy.

Combining analytical methods in the development of experimental therapies

The development of new therapeutic drugs is no easy task given the amount of testing required to bring a drug to market. According to the Pharmaceutical Research and Manufacturers of America (1996), only about 1 in 1000 drug candidates ever makes it past the preclinical stage of testing. Considering the effort required to discover and develop new drugs, the importance of using highly efficient methods for drug target identification becomes clear. This is especially true in the development of combinatorial therapies, in which two or more drugs are administered simultaneously. Large-scale gene expression surveys provide a wealth of possible drug targets, any of which might be used in combination therapies. The result of all these possibilities would be a combinatorial explosion - too many combinations to deal with in the laboratory.

As we explained previously, Shannon entropy is a first step in narrowing the range of drug target candidates from hundreds or thousands of genes expressed in parallel down to a more manageable number for laboratory research. With this method, there is no need to assume that any gene perturbed by a disease or condition is a good drug target. Entropy narrows the field to those genes with the most complex expression patterns over the entire course of a disease, with high entropy genes being the biggest participants in the disease process. A previous study demonstrated that the number of high entropy genes in a perturbed system is quite small (about 5-8%) (12).

At this point, clustering may be applied to the high entropy short list. In the case of a 10 000-gene microarray assay, this list might contain approximately 500 genes. Clustering would identify groups of high entropy (drug target candidate) genes that may function in parallel. In that context, it is reasonable to assume that genes that cluster together have a high probability of functioning together as drug targets. The coupling of Shannon entropy with clustering may therefore be useful in identifying drug targets for use in combinatorial therapies.

How these methods were selected

In selecting analytical methods, the most important consideration is whether the mathematics fits the biological problem. In the case of applying Shannon entropy to drug target discovery, the primary consideration is the biological definition of a drug target. Shannon entropy has a similar definition, and is suitable for application to time series. Similarly, in the case of clustering, the biological assumption that genes that fluctuate in parallel may share a function is the basis for using a method that groups genes by expression similarities. The continued generation of large-scale data sets will provide biologists and computer scientists with ample opportunities to collaborate on the application of these and other techniques.

Welcome to the future: reverse engineering of genetic networks

Is it possible to solve the "ultimate question'' of inferring the major connections of the genetic network of molecular interactions based on well-designed experiments using large-scale gene expression data? Such a global reverse engineering strategy would revolutionize the elucidation of disease pathways and concomitant drug target discovery. Current research in this area is very encouraging: it has been demonstrated that this strategy is theoretically feasible [18], and results from initial experimental studies appear plausible [19].

Beginning from first principles, genomic sequence information is the evolutionary memory of an organism or species that allowed it to execute the functions required to persist in its environment. Essentially, this "static" information is expanded into the "dynamic" information of the molecular (genetic) network underlying development and many of its behavioral responses. While only the complex "hardware" of the cell is able to correctly interpret the genetic programs encoded in sequence information, we may consider an inverse approach of inferring the connections and dynamic rules of molecular interactions through their activity patterns. This defines the term "reverse engineering."

Reverse engineering has been demonstrated to work in principle for model genetic networks of binary genes connected through logical rules [18]. A key issue is the data requirement necessary to provide sufficient information to capture the complexity of the molecular network. In model networks it has been shown that only a tiny subset of all possible behaviors need to be known in order to infer network architecture with accuracy [18], provided that the network exhibits significant constraints (biomolecular networks are far removed from randomly connected networks) [20].

Reverse engineering has been successfully applied to relatively simple biomolecular systems. Using a combination of cross-correlation analysis and multidimensional scaling, the glycolytic pathway was reconstructed from metabolite activity data from an in vitro enzymatic reactor system [21]. A complete spatio-temporal model of developmental gene expression in Drosophila was constructed for a small gene set based on models of differential equations and protein expression data [22].

There are too many unknowns to currently assess whether reverse engineering will work with certainty on large-scale gene expression data that is just becoming accessible with current technologies. In the empirical tradition, this should by no means preclude earnest attempts. One such study employed a simple linear model of gene interaction to supply a sufficient number of constraints that make parameter fitting possible given the present limitation on data complexity. Data complexity not only involves the numbers of genes assayed, but the diversity of conditions under which these activities are measured. In this example (Figure 5.4), temporal expression data from CNS development (two central nervous system areas) and CNS injury (one brain area) covering 27 time points was used to fit a linear gene interaction matrix using multiple regression [19]. This strategy proved successful in that the original data were accurately reproduced by the model (Figure 5.4a), and that major functional gene interactions could be proposed (Figure 5.4b). Interestingly, the conclusion of a positive feedback interaction for two of the genes studied (GAD65 and preGAD67) was reached independently in another work on the developmental modeling of these genes [23]. On the whole, these predictions provide important clues for further experimental investigation. Improving the reverse engineering strategy through acquisition of more informative data and testing of different models holds much promise for a systematic discovery of gene interactions, therapeutic targets and drug function. Overall, the implications of

Reverse engineering of a CNS genetic network using a linear model. Experimental gene expression data (circles; development and injury), and simulation using a linear model (lines). Dotted line: spinal cord, starting at day —11 (embryonic day 11). Solid line: hippocampus development, starting at day

—7 (embryonic day 15). Dashed line: hippocampus kainate injury, starting at day 25. b) Gene interaction diagram for the GABA signaling family inferred from developmental gene expression data (spinal cord and hippocampus data). Solid lines correspond to positive interactions, broken lines suggest inhibitory relationships. Reprinted with permission from [19].

the reverse engineering strategy for a systematic and efficient therapeutic design could be revolutionary.

Genomes and proteomes

Gene expression (mRNA) data may be limited in that we generally lack the corresponding protein expression or protein activity data, and few systematic investigations exist regarding whether RNA expression and protein activity are well-correlated [24]. This will undoubtedly have to wait for the development of better technologies for large-scale protein assays. For the present, we may make the parsimonious assumption that there is a positive correlation between gene and protein expression. As technologies advance for the measure of protein expression and activity, it will be possible to apply the same analytical methods used for genomics data.

Concluding remarks

We now have the technological capacity to conduct experiments that address whole physiological processes, rather than mere snapshots of biological events. However, taking full advantage of this technology requires the proper experimental designs and data analysis methods. While it is possible to use large-scale gene expression assay methods for purposes far below their capacity, this would be a waste of an opportunity to study biomedical problems that were unapproachable prior to the genome projects.

Making efficient use of large-scale gene expression surveys for drug target identification is an exercise in selecting out the essentials of the resulting data. This requires computational methods that address time courses as well as large numbers of genes. Collaborations between biologists and computer scientists will be essential for the development of future data analysis methods.


1 Forsythe E. and Ritzline P. D. (1998) An overview of the etiology, diagnosis, and treatment of Alzheimer Disease. Phys Ther 78(12), 1325-1331.

2 Flint A. J., van Reekum R. (1998) The pharmacologic treatment of Alzheimer's disease: a guide for the general psychiatrist. Can J Psychiatry Sep; 43(7), 689-97.

3 Grundman M., Corey-Bloom J., Thal L. J. (1998) Perspectives in clinical Alzheimer's disease research and the development of antidementia drugs. J Neural Transm Suppl. 53, 255-75.

4 Farlow M. R., Evans R. M. Pharmacologic treatment of cognition in Alzheimer's dementia. Neurology 1998 Jul; 51(1 Suppl 1), S36-44; discussion S65-7.

5 Knopman D. S. Current pharmacotherapies for Alzheimer's disease. Geriatrics (1998) Sep; 53 Suppl 1, S31-4.

6 Shalon D., Smith S. J., and Brown P. O. (1996) A DNA microarray system for analyzing complex DNA samples using two-color fluorescent probe hybridization. Genome_Res 6, 639645.

7 Velculescu V. E., Zhang L., Vogelstein B., and Kinzler K. W. (1995) Serial analysis of gene expression. Science 270, 484-487.

8 Wen X., Fuhrman S., Michaels G. S., Carr D. B., Smith S., Barker J. L., and Somogyi R. (1998) Large-scale temporal gene expression mapping of CNS development. Proc Natl Acad Sci USA 95, 334-339.

9 Carr D. B., Somogyi R., Michaels G. (1997) Templates for looking at gene expression clustering. Statistical Computing and Graphics Newsletter 8(1), 20-29.

10 Michaels G., Carr D. B., Wen X., Fuhrman S., Askenazi M., Somogyi R. (1998) Cluster Analysis and Data Visualization of Large-Scale Gene Expression Data. Pacific Symposium on Biocomputing 3, 42-53.

11 Shannon C. E. and Weaver W. (1963) The Mathematical Theory of Communication. Univ. of Illinois Press, Chicago, IL.

12 Fuhrman S., Cunningham M. J., Wen X., Zweiger G., Seilhamer J. J., and Somogyi R. (2000) The application of Shannon entropy in the identification of putative drug targets. Biosystems 55(1-3), 5-14.

13 Altman J. and Bayer S. A. (1994) Atlas of Prenatal Rat Brain Development. CRC Press.

14 Fuhrman S., Cunningham M. J., Liang S., Wen X., and Somogyi R. (2000) Making sense of large-scale gene expression data with simple computational techniques. American Biotechnology Laboratory, 18, 68-70.

15 Eisen M. B., Spellman P. T., Brown P. O., and Botstein D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95, 14863-8.

16 Ferea T. L., Botstein D., Brown P. O., and Rosenzweig R. F. (1999) Systematic changes in gene expression patterns following adaptive evolution in yeast. Proc Natl Acad Sci USA 96, 9721-6.

17 Perou C. M., Jeffrey S. S., Van de Rijn M., Rees C. A., Eisen M. B., Ross D. T., Pergamenschikov A., Williams,

C. F., Zhu S. X., Lee J. C. F., Lashkari D., Shalon D., Brown P. O., and Botstein D. (1999) Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. Proc Natl Acad Sci USA 96, 9212-7.

18 Liang S., Fuhrman S., and Somogyi R. (1998) REVEAL, A general reverse engineering algorithm for inference of genetic network architectures. Pacific Symposium on Biocomputing 3, 18-29.

19 D'haeseleer P., Wen X., Fuhrman S., and Somogyi R. (1999) Linear Modeling of mRNA Expression Levels During CNS Development and Injury. Pacific Symposium on Biocomputing 4, 41-52 http://www.smi.stanford.edu/projects/ helix/psb99/Dhaeseleer.pdf

20 Somogyi R. & Sniegoski C. (1996) Modeling the Complexity of Genetic Networks, Complexity 1(6), 45-63.

21 Arkin, A., Shen, P., Ross, J. (1997) A Test Case of Correlation Metric Construction of a Reaction Pathway from Measurements, Science 277, 1275-1279.

22 Reinitz, J., Mjolsness, E., and Sharp, D. H. (1995) Model for cooperative control of positional information in Drosophila by bicoid and maternal hunchback, J. Exp. Zool 271, 47-56.

23 Somogyi R., Wen X., Ma W., Barker J. L. (1995) Developmental kinetics of GAD family mRNAs parallel neurogenesis in the rat spinal cord, J Neurosci 15(4), 25752591.

24 Anderson L., Seilhamer J. (1997) A comparison of selected mRNA and protein abundancies in human liver. Electrophoresis 18, 533-537.

Screening of Drug Databases

Martin Stahl, Matthias Rarey and Gerhard Klebe 6.1


The discovery of new and innovative lead compounds is one of the key elements in a drug development project. In consequence much effort has been devoted to various techniques to support and improve this procedure. At present, the establishment of compound collections at pharmaceutical companies, augmented by parallel and combinatorial chemistry, provides large libraries of molecules that can be searched for leads1'. High-throughput facilities for compound screening can generate biological activity data for many targets in very short time [1]. In the early stage of high-throughput screening this new technique has been greeted with much enthusiasm and the end of any rational and knowledge-based approaches was predicted [2]. Today, several years later, a more realistic view has been accepted. Despite the introduction of highly sophisticated robotics and advanced engineering techniques to automate high-throughput screening, such biological testing is not without problems. False positives and unspecific binding of test candidates together with low solubility, non-specific reactions with the protein material resulting in surface adhesion or protein precipitation are only some of the problems that puzzle scientists. Success rates reported for the translation of apparent hits from HTS into leads to be used for lead optimization are quite depressing. At the 1999 Cambridge Healthtech Institute meeting on high-throughput technologies, no pharmaceutical company could report a successful lead candidate from an HTS hit [3]. From a knowledge-based point of view, however, it is most disappointing that high-throughput screening can hardly contribute to our understanding of why and how the detected hits act upon the target. Any increase in the under-

1) A lead compound is a structure that marks the first step in the development of a drug. Lead finding is the process of selecting a lead compound from a compound collection or library. Lead optimization is the process of modifying the lead compound in order to enhance properties such as specificity, bioavailability, and others.

Bioinformatics - From Genomes to Drugs. Volume II: Applications. Edited by Thomas Lengauer Copyright © 2002 WILEY-VCH Verlag GmbH, Weinheim ISBN: 3-527-29988-2

standing of drug action on a molecular level is produced only once experimental structural biology or molecular modeling come into play to detect structural similarity or possible common binding modes among the obtained actives. Such analyses are prerequisite for a rational approach to lead finding and for any further optimization of leads to potential drugs.

As an alternative and complementary approach to experimental high-throughput screening virtual screening has emerged. This term comprises a large variety of computational tools to select potentially active and bio-available compounds from libraries [4, 5]. These tools range from fast filter criteria and crude molecular similarity measures to sophisticated molecular docking and scoring techniques. Which of these tools are applied in a given project depends on the knowledge assembled about a given target and the binding properties of its ligands. Furthermore, the usage of individual tools is restricted by the size of the databases to be screened.

While experimental high-throughput screening already relies on an enormous stock of compounds available to automated test robots, virtual screening could possibly screen any imaginable molecule. The universe of possible drug-like molecules is a tremendously large compound space -there may be around lO100 drug-size organic molecules [5] - from which virtual libraries of any imaginable size could be generated. For reasons of fast substance acquisition and easy validation of the initial computer hits, however, libraries of existing or easily accessible molecules are screened first. A prominent example is the Available Chemicals Directory (ACD) [6], a collection of more than 200000 commercially available compounds. Other databases, for example the World Drug Index (WDI) [7] or the MDL Drug Data Report (MDDR) [8], were compiled from compounds synthesized and studied for their action upon a biological target. The Beilstein database [9] contains a collection of several million entries of organic molecules for which the chemical synthesis has been described. Another fruitful application of virtual screening tools is the selection of building blocks of combinatorial libraries. Each combinatorial library offers fast synthetic access to a large variety of compounds out of a limited set of building blocks. Computers afford generating, handling and recording the vast variety of possible members of combinatorial libraries. Accordingly, the design and selection of a targeted combinatorial library can ideally be combined with criteria derived from virtual screening experiments.

As a first stage in a virtual screening experiment it is reasonable to apply fast filter criteria to eliminate compounds with undesired physicochemical properties. Many drug development projects have failed and still fail in a later state due to insufficient pharmacokinetic properties determining drug absorption, distribution, metabolism, excretion, and toxicity (ADMET parameters) of the developed drug candidates [10]. At present, the various criteria determining these properties are still poorly understood. However, an increasing body of knowledge is collected and allows for the definition of structural restraints that molecules have to meet for being a potential drug. Often, the structure-activity relationships corresponding to the various ADMET parameters are orthogonal to the ones responsible for drug binding to the target under consideration. Accordingly, criteria to select and filter compound libraries to focus on candidate molecules corresponding to acceptable ADMET parameters should be applied at an early stage of the computer screening process. Examples of simple filters are molecular weight thresholds, the total number of donor and acceptor groups [11], or neural networks that have been trained to recognize structural patterns increasing the general "drug-likeness" of compounds [12].

Subsequent steps of compound selection in virtual screening are based on knowledge about the criteria responsible for binding to a particular target and involve concepts of molecular similarity or complementarity. Methods directly involving the complementarity of receptor and ligand are referred to as structure-based methods, because they require that the three-dimensional structure of the target be known, for example, by the determination of its crystal structure, by NMR experiments or via homology modeling. If this is the case, the geometry of the binding-site can be used as a "mold" for calculating possible binding modes of putative candidate molecules and to select molecules based on their fit to the binding pocket. Usually, this is done by means of fast flexible docking algorithms (see Chapter 7 of Volume 1). If no 3-D structure of the target is available, the bioactive conformation of a ligand can likewise be used to search for active compounds by means of algorithms that produce relevant superpositions with respect to a predefined pharmacophore. Both docking and superpositioning methods require fast approaches to detect the features that are responsible for topologically diverse candidate molecules being similar with respect to their recognition in the same portion of a binding pocket. At each step of this process a reliable score of the achieved placement of a candidate molecule into the binding site or onto a reference ligand has to be computed. Accordingly, an essential prerequisite for the success of any virtual screening approach is a robust scoring function to rank computer-generated binding modes or to estimate the degree of similarity obtained by a mutual superposition of a candidate molecule onto a reference ligand.

Molecular docking and structural alignment methods are based on three-dimensional structures of candidate molecules that can be generated by rule-based methods such as CORINA [13, 14] or CONCORD [15]. However, methods based on three-dimensional structures are computationally quite demanding and cannot routinely be applied to databases of hundreds of thousands of compounds today. For this reason, a number of fast methods to determine molecular similarity have been developed that operate solely on connectivity and atom types of molecules. Such tools allow rapid pre-screening of very large databases and avoid a conformational analysis of each candidate molecule.

In this chapter we give an overview of currently used virtual screening methods. The performance of some of these tools is then illustrated by means of test libraries consisting of a subset of the WDI and collections of compounds active at several pharmaceutical^ relevant targets. In a first test scenario, a single active compound per target is used to retrieve similar compounds from a database. By this strategy, the performance of a number of fast similarity algorithms is compared. In the second scenario, it is assumed that only the receptor structure and its binding site are known. Test databases are docked into the corresponding binding sites and ranked according to the score computed for each candidate molecule. The results of such database rankings give an impression of strengths and weaknesses of current docking and scoring tools.

Methods for virtual screening

In this section we give an overview of tools frequently used for virtual screening of drug databases. Based on whether the three-dimensional structure of the target protein is available or not, protein structure-based and ligand similarity-based screening methods are distinguished. In similarity-based methods compounds are selected by their similarity to a reference molecule known to bind to the target protein. In contrast, in structure-based methods compounds are selected according to their fit to the active site of the target protein (see Chapter 7 of Volume 1 for an introduction to docking techniques).

Ligand similarity-based virtual screening

The experimental determination of protein structures is a difficult task and is not routinely possible. Especially membrane-bound proteins are difficult to crystallize [16]. Therefore, the situation that the target protein structure is unknown is common at an early stage of a pharmaceutical project. In such cases, only the structure of molecules already known to bind to the target protein, for example an endogenous ligand, can be used for virtual screening.

When a molecule binds to the protein, it adopts a conformation that sterically and electrostatically fits into the active site. In order to detect compounds possibly binding into the same active site, one can search for molecules that can adopt conformations similar to the given active molecule. This similarity is defined with respect to size, shape and spatial distribution of functional groups and can be regarded as "chemical similarity" [17-19]. In principle, however, we are interested in describing "biological similarity" in the sense that molecules exhibiting the same biological response will be considered as similar. Biological similarity, however, depends on many unknown factors such as details of the protein structure and, therefore, cannot be computed. Chemical similarity can be regarded as a necessary but insufficient criterion for similar biological response of compounds. Fortunately, searching for chemically similar compounds has been demonstrated to be highly successful in many applications [20-22].

Methods for similarity-based virtual screening can be classified by the kind of information used to define similarity. While topological descriptors are based on the 2-D structure (connectivity, bond orders and atom types) only, 3-D descriptors take into account the spatial arrangement of atoms or their associated properties with respect to each other. Topological (2-D) descriptors

The 2-D structure of a molecule can be used to compute bit strings describing the topology of a molecule in a piecewise manner. Originally introduced to solve the problem of fast substructure searching [19], today, bit strings are the most widely applied descriptors for fast similarity searching. Two approaches to create bit strings can be distinguished: structural keys and fingerprints. In a structural key, each bit represents a molecular fragment that is set to 1 if the fragment is present in a molecule [23]. Although this method performs very well in practice, it has the limitation that a reasonable set of fragments has to be predefined. In order to avoid this lack of generality, molecular fingerprints were introduced [24, 25]. The patterns for a molecule's fingerprint are generated from the molecule itself. Each bit represents a sequence of connected atoms having specified atom types, called a path. The fingerprinting algorithm generates a list of all paths up to a specified length. For each path, a hash function calculates the location of the bit that has to be set. This means that a particular bit represents not just one specific bond path, but several paths. The length of the fingerprint can be reduced by a process called folding. The shorter the length of the fingerprint, the more different paths are represented by the same bit and the less specific information is encoded. Examples for programs based on bit string descriptors are structural keys by MDL [26], Daylight fingerprints [25], and UNITY 2-D screens by Tripos [27]. The latter two are used in the application study presented in this paper. Therefore they will be described in more detail.

Daylight fingerprints (DF) [25] are bit strings generated from bond paths of length zero to seven. The length of the fingerprints can be folded until a specified density of set bits is reached. Alternatively, the fingerprint length can be set to a fixed value. The similarity index used is the Tanimoto coefficient, which is the number of bit positions set to 1 in both strings divided by the number of bit positions set to 1 in at least one of the strings. If a set bit is considered as a feature present in the molecule, the Tanimoto coefficient is a measure of the number of common features in two molecules [28, 29].

UNITY 2-D screens (UF) [27] are a mixture of fingerprints and structural keys. Unity allows the user to specify the individual components of the bit string, which can be paths within a specific length range or a variety of substructures. For this study, the default 988-bit strings were used. About 94% of the bits in the string are reserved for encoding paths of lengths 2 to 6 using atom type and bond type information. Another 3% of the bits in the string are reserved for atom counts of frequently occurring hetero atom types and the remaining 3% are used for important fragments such as benzene rings and heterocycles. The Tanimoto coefficient is used as a similarity index.

Several other descriptors based on the topology of molecules were developed and have been successfully applied. Reviews on descriptor technology can be found in [19, 20, 22, 30]. The following discussion focuses on three recently developed descriptors used in the application study in section 3. The selection of descriptors was based on availability and speed of the software.

Correlation vectors (CV) [31] are created by first assigning a generalized atom type to each atom. In the implementation applied here [32], five atom types were used: hydrogen-bond acceptor (A), hydrogen-bond donor (D), positively charged (P), negatively charged (N) and lipophilic (L). In a second step, each pair of atoms in the molecule is classified by their assigned generalized atom types and the number of bonds by which they are separated. The correlation vector, a string of integers, is a histogram representation of how often each possible pair of atom types is found at each distance. For bond paths with up to n bonds this results in correlation vectors of n times 15 integers. In order to achieve a size-independent description, each integer is divided by the total number of non-hydrogen atoms in the molecule. The Euclidean distance between vectors is used as a similarity index. Values of n — 6 through n — 10 were used in this work and did not give significantly different results. Therefore, n — 6 was chosen for comparison with other methods.

The methods described so far have in common that they create a one-dimensional representation of the molecule from its 2-D structure, either a bit string or an integer vector. The advantage of this representation is that, in order to calculate a similarity index, there is no need to assign parts of two molecules or molecule representations to each other. This results in simple and extremely fast comparison algorithms. However, since molecules are three-dimensional objects, substantial information is neglected and, in consequence, some accuracy is sacrificed.

The following two descriptors are not based on a one-dimensional representation and, therefore, need an assignment procedure for the similarity calculation. The first descriptor is based on a matrix representation, the second is based on a tree representation. The two methods also have in common that they read a single 3-D conformation of each input molecule.

Topological pharmacophores (TP) [33] are based on data generated by the force field MAB [34, 35]. Hydrogen-bond strengths are assigned to atoms on a continuous scale. The atoms are then classified as hydrogen-bond donors, hydrogen-bond acceptors or hydrophobic. Hydrophobic atoms are grouped together according to a fixed maximum inter-atom distance criterion. A molecule is described by a set of donor, acceptor and hydrophobic centers and a matrix containing the through-space distances between these centers. Topo-logical pharmacophores thus depend on the input conformation to some extent. To arrive at a similarity index for two molecules, centers of the same type are matched onto each other, leading to several alternative pairings for all centers. For each of these pairings, a similarity score between zero and one is calculated based on hydrogen-bond strength and the differences in mutual distances of the matched centers. The final similarity index is the highest similarity score calculated for one of the pairings.

Feature trees (FT) [36] describe the overall topology of a molecule with a tree structure. They are independent from the input conformation of the molecule. Starting from the molecule graph, the corresponding feature tree is created by shrinking rings as well as terminal atoms with their neighbors to single nodes (see Figure 6.1 for an example). A steric and a chemical feature are assigned to each node, describing the part of the molecule represented by the node. The approximated van der Waals volume is used as a steric feature. The chemical feature is an interaction profile summarizing which interactions can be formed by the molecule part. In order to calculate a similarity index for two feature trees, nodes of the trees have to be assigned to each other. This is done by the so-called split-search algorithm, which divides and assigns subtrees in a recursive fashion. In this way, the algorithm calculates an assignment between nodes by optimizing the similarity index. Three-dimensional descriptors

While the previously discussed topological descriptors are based on the molecular graph only, a three-dimensional descriptor uses information about the spatial arrangement of the atoms or functional groups of a molecule. Since the process of complex formation between a receptor and a ligand is mainly driven by steric and electrostatic surface complementarity, methods based on 3-D information seem to be more appropriate than topological features, at first sight. However, in practice it has often been observed that topological descriptors outperform 3-D descriptors [20, 30, 37]. A general problem with 3-D descriptors is handling conformational space. In principle, all low-energy conformers of a molecule would have to

A molecule and its corresponding feature tree representation. Feature tree nodes are colored by the interaction profile. Red:

hydrogen bond acceptor, blue: hydrogen bond donor, green: hydrophobic. Mixed colors correspond to mixed profiles.

A molecule and its corresponding feature tree representation. Feature tree nodes are colored by the interaction profile. Red:

hydrogen bond acceptor, blue: hydrogen bond donor, green: hydrophobic. Mixed colors correspond to mixed profiles.

be taken into account, but conformational flexibility is often neglected and the descriptor is calculated for a single conformation only. On the other hand, it is likely that most data sets used for benchmarking similarity measures are biased towards methods based on 2-D features, because inhibitors are often designed and synthesized in terms of series comprising common structural elements.

As in the 2-D case, there is a wide range of different descriptors (see [3840] for reviews). Most of them are one-dimensional representations, i.e. they project 3-D information into a bit- or integer string. In most cases the analysis of the 3-D structure is reduced to so-called pharmacophore elements or "hot spots" in the ligand structure that are most relevant for specificity and activity. Hydrogen bond donors, acceptors, and hydrophobic centers as well as their mutual distances are commonly used to address a bit in the string. This pair approach can be extended to triplets, where a characteristic value of the formed triangle (circumference, wing area, etc.) instead of the distance is used for addressing [41]. The approach can also be applied to conformation ensembles by disjunction of bit strings created from different conformations [42]. As a logical extension of such 3-point pharmacophores, 4-point pharmacophores have been used [43].

In contrast to the previous approaches, some other descriptors are based on superimposing 3-D structures [44-46], leading to more time-consuming comparison algorithms. An example are so-called field descriptors. They are derived from the analysis of superimposed molecules on a rectangular grid, whose elements contain local properties of each molecule [47]. Other similarity calculations are based on fragment superposition [48, 49]. An interesting new approach for similarity calculations is the determination of affinity fingerprints [50, 51]. Here, the affinity to a reference panel of proteins is determined for each compound in the database - either experimentally or by means of docking. Similarity indices are then calculated by comparing the affinity vectors. The underlying idea is that compounds showing similar binding properties to the reference panel of proteins will also display similarity with respect to unknown targets.

Structure-based virtual screening

In structure-based virtual screening, compounds are selected based on their fit into the active site of a protein. Two components are necessary for structure-based virtual screening, a docking algorithm and a scoring function. Fast docking methods

Docking algorithms require the 3-D coordinates of the target protein and of a candidate molecule from the database (the ligand) as input and predict the geometry of the formed molecular complex. The algorithm has to determine several degrees of freedom, namely the relative orientation and the conformation of ligand and protein. It is assumed that the protein conformation does not change significantly upon binding, since the protein is large and the active site has a concave surface. Furthermore, the structure of a ligand-bound complex is often used as a starting point. Experience shows that conformational properties are conserved much better among ligand-bound protein structures than among the apo (unbound) form and any ligand-bound case. However, the ligand conformation can undergo drastic changes compared to the solvated state. Algorithms considering the ligand conformation in addition to the relative orientation of the molecules are classified as flexible docking algorithms. Some algorithms handle protein flexibility simultaneously, but they have not yet been used for screening purposes.

A wide range of different methods has been applied to solve the docking problem. A review on protein-ligand docking algorithms is given in Chapter 7 of Volume 1 or elsewhere [52-58]. Here only the FlexX method used in the application study is outlined very briefly. FlexX [59-65] is based on an incremental construction algorithm. The method is based on the assumption that conformational flexibility of a ligand can be handled by dividing the ligand into smaller fragments at its rotatable bonds and reassembling the fragments within the active site. The algorithm consists of three phases. First, a set of so-called base fragments is selected. A base fragment is a part of the ligand which is not too flexible and is able to bind to the protein via directed interactions like salt bridges, hydrogen bonds, or directed hydrophobic interactions. The base fragments are then placed into the active site of the protein. For this step a pattern recognition algorithm is used to directly create energetically favorable placements. Finally, the remaining fragments of the ligand are successively connected to the base fragments.

Each time a new fragment is added, it is rotated around the single bond generated in the previous addition step to explore its conformational space. Then a set of high-scoring partial placements is selected for the next buildup step. Scoring protein-ligand complexes

The second integral part of structure-based virtual screening is the scoring function. It operates on the coordinates of the protein-ligand complex and determines a score based on the generated ligand binding conformation and orientation. In an optimal case the score shows a high correlation to the observed binding affinity of the ligand to the protein. Unfortunately, the objective to reliably predict free energies of binding can hardly be achieved. The main reason for this limitation is the fact that important contributions to the binding affinity such as solvation effects or changes in entropy-determining ordering parameters upon complex formation are only poorly understood [66]. Nevertheless, scoring functions available today allow for achieving a good separation of true actives and inactive compounds when they are applied in virtual screening. Overviews on scoring functions can be found in the literature [67-72]. Here only the two scoring functions used in the application study in Section 6.3 will be described.

For scoring protein-ligand complexes in FlexX, a slightly modified version of the Böhm function [73] is used. The function belongs to the class of regression-based empirical scoring functions [71]. Empirical scoring functions are linear combinations of terms, each of which is a simple geometrical function derived from the ligand and receptor coordinates. The terms are chosen such that they intuitively cover important contributions to the total binding free energy. The modified Böhm function is a linear combination of seven terms, namely the entropy loss due to immobilizing tor-sional degrees of freedom in the ligand upon complex formation, neutral and charge-assisted hydrogen bonds, interactions of aromatic ring systems with other groups, lipophilic and ambiguous contact surfaces:

AG = A G0 + AGrotNrot + AGhb E f (AR) f A)

neutral H bonds

ionic interactions

aromatic interactions lipo atom contacts

ambig. atom contacts

Each term sums individual contributions scaled by geometric penalty functions f. For example, the contribution of a hydrogen bond falls into ranges from AGhb to zero, depending on its deviation from idealized hydrogen-bond geometry (length and angle at the hydrogen atom). The relative weights of the individual contributions (the AG values) have been calibrated on protein-ligand complexes with known structure and binding affinity.

An alternative approach is realized in the DrugScore function [74], which belongs to the class of knowledge-based scoring functions. The basic idea of knowledge-based scoring is that interatomic contacts occurring more frequently than in a mean reference state are energetically favorable. Based on this assumption, pair potentials can be generated from occurrence frequencies counting the distribution of atom-type pairs as a function of their inter-atom distances. The DrugScore function consists of two terms, the inter-atomic potentials and a solvent accessible surface (SAS) term:

AW = y^Yl AWtype(ji), type(i) (m)

J2 A<ge(k) (S°k - %) ) ^-

k I

y: adjustable parameter AWtype,type: pair potential

Sfoom: SAS of an atom in complexed or uncomplexed form AWtype: SAS single atom potential i: index for ligand atoms i: index for protein atoms

The first term is the actual pair potential yielding a distance-dependent score contribution for each protein-ligand atom pair within a given distance interval. The SAS-term contains single atom contributions covering solvent effects. In the application described in the following, the DrugScore function was not used in the incremental construction phase, but only to rescore the 500 best scored FlexX solutions for each protein-ligand complex. In addition, the FlexX entropy penalty term was added to the score for each ligand. AGrot was set to 14000 to adjust to the order of magnitude covered by the knowledge-based scores.

Practical virtual screening

As outlined m the introduction, virtual screening m pharmaceutkal mdus-try ;s a tool to find new lead compounds for a target. In a reahstk situation, one or several ligands of the target may be known - from screening, from the literature or from patents - and the goal is to find different compounds with similar inhibitory properties. Alternatively, instead of or in addition to known inhibitors, the target 3-D structure may be available. The similarity measures or docking methods introduced in the previous section can then be used to rank order compound databases according to the likelihood that an individual compound will be an inhibitor. Since a complete and unambiguous separation of active and inactive compounds is not achievable, the term enrichment is used here. A database search can be considered successful if the top ranks of the ordered database are enriched with active compounds, i.e. more densely populated with active compounds than can be expected from a random distribution of hits. In the next two sections of this chapter, the ability of individual methods to enrich active compounds is assessed by means of test databases comprising known active and inactive compounds.

First test scenario: application of fast similarity searching algorithms

In this test scenario, a single active compound per target is used to retrieve similar compounds from a database. Five fast algorithms are used for this purpose. Results are compared and possible ways to combine the results of searches with different algorithms are discussed.

Library generation

Diverse sets of inhibitors were compiled manually for eleven well-established pharmaceutical targets. The main - subjective - selection criterion was to cover as many different compound classes as possible for each target. Compounds were retrieved from the WDI and from available complex crystal structures in the PDB [75]. Where patent literature was consulted in addition, references are given in the following list. The final data set consisted of:

• 43 angiotensin II antagonists [77]

• 33 aldose reductase inhibitors [78, 79]

• 32 cyclooxygenase inhibitors [80-82]

• 55 HI antihistamines [83]

• 41 HIV protease inhibitors [84]

• 21 HIV reverse transcriptase inhibitors [85]

• 19 monoamine oxidase A inhibitors [86, 87]

• 23 gelatinase A inhibitors [88]

• 36 thrombin inhibitors [89, 90] and

• 37 topoisomerase II inhibitors [91].

All compounds were stored as SMILES and converted to single 3-D conformations in Sybyl mol2 format2' by means of CORINA [13, 14]. A C routine was used to generate the molecular structure most likely to be the dominant species at neutral pH under physiological conditions. Hydrogen atoms were removed from acidic groups to generate the anionic form, and were added to aliphatic amines and acyclic amidine nitrogens to generate the charged cationic form.

A subset WDI database was prepared by stepwise removal of

• compounds containing one of the mechanism keywords "ACE", "aldose-reductase", "ampa", "angiotensin", "antihistamine", "beta-lactamase", "cyclooxygenase", "hiv", "mao", "thrombin", "toposiomerase" (in order to ensure complementarity to the individual sets of active compounds in this and other studies),

• compounds with molecular weight greater than 800 or less than 200 (because drug-like molecules should fall in this molecular weight range)

• compounds with saturated carbon chains longer than seven carbon atoms (in order to avoid non-drug-like alcohols and fatty acids)

• compounds without at least one oxygen (very few drugs contain no oxygen atom)

• compounds without at least one hydrogen (to exclude

Continue reading here: Info

Was this article helpful?

0 0