Info

COX2

1.5-IQR

Outlier

CV DF TP FT UF

Box plots showing the variation of the percentage of active compounds contained in the top 2% of the ranked database. Results are shown for the COX2 and thrombin libraries only. In the legend on the right, IQR is the inter-quartile range, i.e. the interval of a data set that is centered around the median and contains 50% of the data points.

thrombin and COX2, the percentage of active compounds within the top 2% of the database varies between zero and very high values. For thrombin, the average values shown in Figure 6.6 seem to indicate that the order of performance for the five methods is TP, DF, UF > FT > CV, but from the Box plots it is apparent that the difference between CV and FT is actually very small. In fact, it is insignificant at an error level of 5%, as can be deduced from statistical analyses of the data by means of the Tukey test [99]. Similarly, the plots for COX2 in Figure 6.3 suggest the order FT > UF, DF > TP > CV. Consulting the Box plots one must conclude that the median percentage of actives retrieved is almost as low for DF as it is for CV and TP. According to the Tukey test, at an error level of 5% the only robust statements about the COX2 data are that FT performs better than CV and TP.

From the above, it can be concluded that it is difficult to predict which of the methods will work best for a particular query molecule. To increase the success rate, it is therefore advisable to perform searches with two or more different methods. But how should the results be combined? Two possible strategies to select compounds from two searches employing methods A and B are:

1. Selection of all molecules that receive high ranks with either method A or method B. For each compound occurring in both the A and B rank lists, the position with the higher rank is selected and the combined list is sorted again according to rank. In cases of conflict, compounds are placed in arbitrary order. For this union strategy to be successful, methods A and B should emphasize different aspects of similarity and should therefore retrieve different active molecules on top ranks. It is then likely that more hits will be found in the union rank list than in any individual rank list.

2. Selection of all molecules that receive high ranks with both method A and method B. Such a procedure is successful if a common set of active molecules is ranked high by both methods A and B, while the topranking inactive compounds differ. This intersection strategy will be followed if it is important to reduce the number of false positives - even at the cost of reducing the number of true positives.

For the present test libraries, the union strategy was tested by counting the number of actives among the top 2% of the individual rank lists in comparison to those among the top 2% of the union rank list. On average, the union strategy leads to significantly more hits than the weaker individual method, but it is rarely superior to the better performing method. Table 6.1 illustrates this observation for the ACE sublibrary. The union strategy thus does not retrieve more actives, but, since it is difficult to estimate in advance which method will perform better, it leads to more robust results. A generally valid recommendation is to combine one of the fingerprint-based methods with any of the other three algorithms.

An example for the application of the intersection strategy is given in Table 6.2 with the antihistamine 7 (Fig. 6.7) as a query molecule. Ctot is the

Results of the union strategy for the set of ACE inhibitors: Table entries are the average numbers of active compounds contained in the top 2% (158 compounds) of the union rank lists. Values on the diagonal (combinations of each method with itself) are the average number of actives in the top 2% of the individual rank lists.

FT

TP

CV

DF

UF

FT

19

TP

17

9

CV

18

12

10

DF

22

17

18

21

UF

27

23

24

28

32

Results of the intersection strategy for query molecule 7 in Figure 6.7: Ctot is the total number of common molecules in the top 2% of both rank lists, Cact is the number of HI antagonists contained therein. /act is the number of active molecules contained in the Ctot top ranking molecules of the library ranked according to the method given in the leftmost column. Diagonal elements of this table give the number of active molecules within the top 2% of the ranked database for each method (where Cact = /act and Ctot = 158).

FT

TP

CV

DF

UF

C tot

Cact

'act

Ctot

Cact

'act

Ctot

Cact

'act

Ctot

Cact

' act

Ctot

Cact

' act

FT

158

13

13

19

4

5

8

3

3

12

6

4

16

7

5

TP

19

4

2

158 8 8

3

1

0

9

4

1

12

3

1

CV

8

3

0

3

1

0

158

5

5

11

3

0

13

3

0

DF

12

6

2

9

4

1

11

3

2

158 8 8

90

7

6

UF

16

7

2

12

3

2

13

3

2

90

7

5

158 8 8

number of molecules occurring in the top 2% of both rank lists; and Cact is the number of active molecules contained therein. From Table 6.2 it can be seen that the proportion of active molecules in the intersection lists is relatively high compared to the top 2% of the individual rank lists (values on the diagonal of Table 6.1). This comparison is not fair, however, since most of the active molecules should be found on the highest ranks of the individual rank lists. Considering only a single rank file, one would select only very few top-ranked compounds in order to keep the number of false positives to a minimum. Therefore, Cact is best compared to the number of actives among the Ctot top ranked compounds of each individual ranking list. This number is given as 7act in Table 6.2.

In the example in Table 6.2, Cact is larger than 7act for most combinations, which means that the intersection strategy identifies more active molecules than any individual algorithm. The efficiency of the intersection strategy is smallest for the combination of the two fingerprint methods DF and UF, because the two rank lists overlap significantly. For all other combinations, between 1/4 and 1/3 of Ctot are active compounds. Figure 6.7 shows results of the combination of UF and FT. Seven active compounds (8-14) are contained within the intersection list. Compounds 13 and 14 are among the top 16 compounds of both the UF and the FT rank lists. In addition, the top 16 compounds of the FT rank list also contain compounds 12, 15 and 16. The latter two compounds are not on the intersection list, since they are on rank positions beyond 158 on the UF rank list. The example presented in Table 6.2 is representative for the data used in this study. On average, the intersection strategy significantly reduces the number of false positives. Best results are achieved if two methods are combined that perform well sepa-

Fig. 6.7 within the top 16 molecules of the FT and

Results of the intersection strategy for the DF rank lists, meaning that they are

H1 receptor antagonist 7 as a query regarded as highly similar to the query molecule. Compounds 8-14 are contained molecule by both methods. The top 16

within the 16-membered intersection list of molecules of the FT list also include 12, 15

the DF and FT rank lists. Compounds 13 and 16. and 14, encircled by a dashed line, are

Fig. 6.7 within the top 16 molecules of the FT and

Results of the intersection strategy for the DF rank lists, meaning that they are

H1 receptor antagonist 7 as a query regarded as highly similar to the query molecule. Compounds 8-14 are contained molecule by both methods. The top 16

within the 16-membered intersection list of molecules of the FT list also include 12, 15

the DF and FT rank lists. Compounds 13 and 16. and 14, encircled by a dashed line, are rately. The combination of FT and one of the fingerprint methods can be generally recommended.

Second test scenario: docking as a tool for virtual screening tool

In this scenario, it is assumed that only the receptor structure and its binding site are known. Test databases are docked into the corresponding binding sites and ranked according to the score computed for each candidate molecule. The enrichment of active compounds on top ranks is analyzed and reasons for performance differences between the test targets are discussed.

Library generation

Five of the 11 sets of inhibitors described in Section 6.4 were used again in the second scenario, namely the thrombin, gelatinase A, HIV protease, HIV reverse transcriptase, and COX2 data sets. A set of 29 p38 MAP kinase inhibitors was assembled from PDB crystal structures and Ref. [100]. A set of 40 estrogen receptor ligands was taken from the Protherics web site [101]. These two data sets were converted to Sybyl mol2 format and protonation states were adjusted as described above.

Docking procedure

For each of the six targets, protein structures were retrieved from the PDB or the Roche in-house collection of protein structures. Binding pockets were defined manually using the interactive modeling program MOLOC developed at Roche. For thrombin the PDB complex 1dwd was selected and the water molecule adjacent to Tyr 228 in the S1 pocket included as part of the binding site. For gelatinase A, the only available X-ray structure was that of a proenzyme mutant (1ck7). Since only minor structural changes can be expected upon complexation and the active site region can be well superimposed onto other MMP structures, the structure was used in an "in silico activated" form. The N-terminal propeptide was removed and residue 404 mutated to Glu as in the wild-type enzyme. For HIV protease, the PDB complex 9hvp was selected. The flap water bound to Ile50A and Ile50B was optimized by means of the MAB force field [102] and hydrogen atoms were added by MOLOC. Docking was performed both neglecting and including this water molecule as part of the binding site. PDB structures were also selected for reverse transcriptase (1klm) and the estrogen receptor (1ere and 1err). For p38 MAP kinase as well as COX2, Roche in-house crystal structures were used.

For each target, the corresponding set of inhibitors was combined with the WDI subset and the resulting library docked into the binding site. All docking calculations were performed with the program FlexX using standard parameter settings, except where noted otherwise. Calculations were run in parallel on 10 SGI R12k processors, typical runs for 7600 compounds taking 10-15 hours of CPU time. The structure with the lowest score was saved for each compound and its score was used for database ranking.

160 | 6 Screening of Drug Databases Tab. 6.3

Results of database ranking by means of docking for six targets. Table entries are the percentage of all active compounds retrieved within the top 2%, 5% and 10% of the ranked databases consisting of the WDI subset and one collection of active compounds.

target % actives within top

2%

5%

10%

thrombin

57

81

89

gelatinase A

21

42

61

p38 MAP kinase

17

24

55

COX2, standard

0

0

13

COX2, lipo scoring

13

19

25

HIV protease, no flap water

8

8

17

HIV protease, flap water

14

28

41

estrogen rec. (1err), FlexX

35

38

43

estrogen rec. (1err), DrugScore

35

45

63

HIV reverse transcriptase

5

10

14

Compounds for which no docking solution was obtained were appended to the sorted rank list in arbitrary order.

Discussion of docking results

Results of the docking calculations are summarized in Table 6.3, which contains the percentage of active compounds within the top 2, 5, and 10 percent of the ranked databases. Docking results for p38 MAP kinase are also graphically depicted in Figure 6.8. For thrombin, p38 MAP kinase and gelatinase A, excellent enrichment of active compounds is obtained, while for HIV protease results are moderate and for COX2 (standard scoring) and HIV reverse transcriptase the distribution of inhibitors within the rank lists does not deviate significantly from a random selection. In the following, reasons for the obvious performance differences are discussed in more detail.

For thrombin, an almost quantitative separation between known active inhibitors and the remaining database is achieved. Visual inspection of the structures reveals all top-ranked molecules in the library to contain a ben-zamidine group binding to Asp 189 at the bottom of the S1 pocket, which contributes significantly to the calculated score. The dominant influence of this charged interaction to achieve significant enrichment of thrombin inhibitors is easily demonstrated by artificially removing contributions of interactions to Asp 189 to the final score. The percentage of top-ranked

Accumulated percentage of p38 MAP kinase inhibitors plotted versus percentage of the ranked database subset and a bar diagram showing the rank of each inhibitor as a horizontal line.

% of ranked database

Accumulated percentage of p38 MAP kinase inhibitors plotted versus percentage of the ranked database subset and a bar diagram showing the rank of each inhibitor as a horizontal line.

1000 2000 3000 4000 5000

6000

7000

inhibitors then drops dramatically to 8.3% at 2% of the database (22.2% at 5%, 38.9% at 10%). This observation appears to be generally valid: Wherever specific (for example hydrogen bond or metal contact) interactions are of importance for affinity and whenever these interactions are detected by the docking algorithm, significant enrichment of active compounds can be obtained. It is important to notice, however, that for thrombin charged ligand groups are not necessarily required in the Slpocket. Potent compounds with lipophilic S1 substituents, such as the Ely Lilly compound 17 and the Merck compound 18 shown in Figure 6.9 (S1 moieties inside the box), have been discovered. They are placed on relatively low ranks (between ranks 1200 and 1500) in the present docking study.

Specific interactions also play an important role in the case of gelatinase A and p38 MAP kinase. The binding site of gelatinase A is valley-shaped with a deep S1' pocket, which can accommodate lipophilic groups as large as biphenyl moieties. The active site is flanked on either side by beta strands allowing for hydrogen bonds to the ligand. These interactions and contacts of inhibitor carboxylate or hydroxamate groups with the catalytic zinc atom result in high ranks once the ligands are correctly placed.

ATP binding sites of kinases are lipophilic clefts formed mainly by ali-

Two examples for thrombin inhibitors that do a basic SI moiety.

Two examples for thrombin inhibitors that do a basic SI moiety.

not possess

phatic amino-acid side chains. In the adenine-binding region, binding sites are narrow and can accommodate only planar conjugated systems. A special feature of the p38 MAP kinase is an extended, deeply buried lipophilic cavity adjacent to the adenine-binding region. The confined nature of this binding pocket allows FlexX to discard many WDI compounds solely based on shape criteria. One rim of the adenine binding pocket is marked by a beta strand orienting a backbone NH group into the binding site. Kinase inhibitors invariably form a hydrogen bond to this NH group and in addition frequently another bond to one of the two flanking carbonyl groups. Again the dominant contribution of the hydrogen-bond score on the total score is obvious. Inhibitor 19 (Figure 6.10), forming three hydrogen bonds to the protein, is placed on rank 20, although FlexX generates a binding mode reverse to crystallographic evidence. In the FlexX generated mode in which the fluorinated phenyl ring is oriented towards the solvent instead of being located in the buried lipophilic pocket. Conversely, inhibitor 20 is placed on a very low rank (> 3000), since it forms only one hydrogen bond to the receptor.

Schematic representation of inhibitor-protein hydrogen bonds in two classes of p38 MAP kinase inhibitors.

Schematic representation of inhibitor-protein hydrogen bonds in two classes of p38 MAP kinase inhibitors.

The COX2 binding site is a narrow, completely buried lipophilic cavity. At its entrance the side chains of Arg 120 and Glu 524 form a mobile ion pair. Some inhibitors form hydrogen bonds to the guanidinium group of Arg 120, others place lipophilic groups in the same region. Hydrogen bond formation is thus not a predominant feature of inhibitor binding. Accordingly, many WDI compounds with docking solutions involving several spurious hydrogen bonds with the protein are ranked better than the known COX2 inhibitors. When the FlexX solutions for the WDI + COX2 database are re-scored with a modified scoring function in which the weight of lipophilic interactions has been arbitrarily doubled (AGjipo = —0.34 kj/mol A) and charged hydrogen bonds receive no extra weight (charge factor = 1), an enrichment increase is observed ("lipo scoring" entry in Table 6.3).

The estrogen receptor binding site is a large lipophilic cavity with acceptor groups at either end that can form hydrogen bonds with ligand hydroxyl groups for example as present in the agonists 21 and 22 or the antagonists 23 and 24 (Figure 6.11). For both agonists and antagonists, lipophilic interactions largely determine the binding energy. The majority of antagonists, however, differ from the agonists in an additional side chain bearing a tertiary amino group. This difference is reflected in the bound structures of the receptor. In the agonist-bound state the binding pocket is not accessible to solvent, whereas in the antagonist-bound state it opens up and allows the positively charged antagonist side chain to form a salt bridge with the car-boxylate group of Glu 351. The WDI subset and library of estrogen ligands were docked into both agonist (1ere) and antagonist (1err) forms of the receptor. FlexX scores obtained from both structures are plotted against each other in Figure 6.11. The narrow binding pocket and relatively rigid nature of the ligands restrains possible binding modes significantly. Accordingly, it can be assumed that FlexX is capable of generating reasonable solutions likely to be in agreement with experiment. Therefore we can expect the present example to represent a valuable test for scoring functions. It is observed that antagonists are clearly separated from the WDI compounds, whereas the agonists receive moderately high ranks in the agonist structure and are ranked among the bulk of the WDI entries in the antagonist structure. The reason for this finding is the large contribution of the surface-exposed salt bridge formed with Glu 351 to the total score. In the antagonist form, many WDI molecules also form such a salt bridge, resulting in a better score for many WDI molecules. For this reason - as in the thrombin or p38 MAP kinase examples discussed above - ligands not forming an equivalent salt bridge are not ranked as favorably. Almost the same result as with the FlexX score is obtained with the Chemscore function [103, 104] by Protherics [101]. The new DrugScore function developed by Gohlke et al. [74] performs better in this situation. Using this scoring function, applied as described in Subsection 6.2.2.2, results shown in Figure 6.11 are obtained. The agonists are separated significantly better from the WDI subset when docked into the agonist structure. More importantly, about half of the

6 Screening of Drug Databases 10

-50 -40 -30 -20 -10 0 10 FlexX score in agonist structure (1ère)

Plots of docking score in the antagonist form of the estrogen receptor (PDB code lerr) against docking score in the agonist form of the same receptor (PDB code

-5 -2 x 105 o> -3x 105 -4x 105 ? -5x105 -6x105

-6x 10® -4 x 106 -2x105 0

DrugScore in agonist structure (1ère)

23 ho 24

lere) for 20 agonists, 20 antagonists and a WDI subset. Compounds 21 and 22 are typical agonists, compounds 23 and 24 are typical antagonists of the estrogen receptor (for details see text).

6 Screening of Drug Databases 10

-50 -40 -30 -20 -10 0 10 FlexX score in agonist structure (1ère)

23 ho 24

lere) for 20 agonists, 20 antagonists and a WDI subset. Compounds 21 and 22 are typical agonists, compounds 23 and 24 are typical antagonists of the estrogen receptor (for details see text).

-6x 10® -4 x 106 -2x105 0

DrugScore in agonist structure (1ère)

agonists are also among the 10% top ranked molecules in the database when they are docked into the open, antagonist structure.

Finally, HIV protease and HIV reverse transcriptase are difficult targets for quite different reasons. Most of the HIV protease inhibitors are relatively large molecules that must be placed accurately into the tunnel-shaped active site to prevent clashing with protein atoms. The FlexX dihedral angle increments do not always permit such an accurate placement. Therefore, several HIV protease inhibitors were docked to regions outside the presumed active site. Furthermore, one must distinguish inhibitors binding to the structural water molecule in the flap region from others replacing it. Out of the 41 inhibitors presently regarded, 12 replace the flap water, 29 do not. Enrichment data given in Table 6.1 were calculated separately for these two groups. In the case of HIV reverse transcriptase, considerable induced fit is observed for most of the inhibitors. This is reflected by the observation that the native ligand corresponding to the crystal structure used receives a rather high rank (11), but all other active compounds are distributed almost randomly among the ordered database. Furthermore, many inhibitors form water-mediated hydrogen bonds to the protein. Even if the position of essential water molecules can be predicted correctly, the question how to score these contacts has yet to be solved. FlexX has an option to place water molecules during the construction of the ligand [63]. This feature was tested for the structure 1klm, but no improved enrichment was observed.

Conclusions and outlook

The first test scenario presented in this paper allows for the conclusion that fast similarity measures perform remarkably well when clustering compounds with similar biological properties. Performance increases once the results of searches based on different algorithms emphasizing different aspects of similarity are combined. However, a certain degree of chemical similarity expressed in common skeleton topology must be present for satisfactory performance. Compounds rated as diverse that still bind to a receptor in a similar mode can probably only be identified by means of 3-D superpositioning algorithms including full conformational analyses of each candidate, which is prohibitive for searches in large databases.

The identification of active compounds by means of docking, addressed in the second test scenario, is routinely successful for those targets and inhibitor classes forming a significant number of hydrogen bonds to the receptor. These directed interactions alleviate the generation of correct inhibitor binding modes and result in high ranks for active compounds. In cases where hydrogen bonding plays a minor role, scoring becomes difficult. Supposedly, lipophilic contacts have to obtain higher weight and a more detailed description of lipophilic interactions in the scoring function should be advantageous. As demonstrated for the example of the estrogen receptor ligands, this appears to be reflected better in the recently developed Drug-Score function [74]. More detailed investigations will have to show whether DrugScore and other knowledge-based scoring functions [105, 106] are generally superior to traditional empirical ones. A common observation among all screening runs described here is that top ranks are contaminated by structures showing geometrically unreasonable binding modes in the binding pocket. For example, these involve unsatisfied buried donor groups, multiple hydrogen bonds to solvent exposed groups of the protein or unoccupied lipophilic cavities at the receptor-ligand interface. One possible approach to eliminate such unlikely solutions is the development of additional penalty terms and further filters [107]. An interesting approach is the combination of several functions to a consensus scoring function to eliminate unreasonable solutions [108]. This strategy is very similar to the intersection approach described in Section 6.3.3. The placement and scoring of water-mediated interactions as well as an appropriate treatment of induced-fit phenomena remain on the list of unresolved problems. In addition, the uncertainty about protonation states has to be mentioned. Depending on the original pKa value of a functional group in aqueous solution, significant pKa shifts can be witnessed upon ligand binding, either of the ligand's functional groups or the protein residues. This finding can be attributed to local changes of the dielectric conditions inside the binding pocket. With respect to docking the change of a donor to an acceptor functionality due to modified protonation states has important consequences. Accordingly, improved docking, and scoring algorithms will have to use a more detailed and flexible description of protonation states.

It has to be kept in mind that the groups of active molecules used in this study mostly comprise high-affinity ligands. However, at the outset of a new drug development project, one can optimistically expect only weakly binding ligands. Furthermore, whatever virtual screening tools are applied, they can only identify active compounds for a new target if they are contained in the libraries that are to be screened. Intelligent assembly of diverse virtual libraries is therefore an important prerequisite of screening.

Acknowledgments

We wish to thank Paul Gerber, Man-Ling Lee, Chiara Taroni, and Gisbert Schneider at Roche, who participated in the testing of the similarity search algorithms and Holger Gohlke for discussions and valuable help with the DrugScore function.

References

1 H. Kubinyi, Pharmazie 1995, 50, 647-662.

2 J. Kuhlmann, Int. f. Clin. Pharmacol. Ther. 1997, 35, 541-552.

3 R. Lahana, Drug Discovery Today 1999, 4, 447-448.

4 J. H. Van Drie, M. S. Lajiness, Drug Discovery Today 1998, 3, 274-283.

5 W. P. Walters, M. T. Stahl, M. A. Murcko, Drug Discovery Today 1998, 3, 160-178.

6 Available Chemicals Directory, MDL Information Systems Inc., San Leandro, California, USA.

7 World Drug Index, 2/96, Derwent Information 1996.

8 MDDR, MACCS Drug Data Report, MDL Information Systems Inc., San Leandro, California, USA.

9 Beilstein Database, Beilstein Informationssysteme GmbH, Frankfurt, Germany.

10 J. Cadwell, I. Gardner, N. Swales, Toxicol. Pathol. 1995, 23, 102-114.

11 C. A. Lipinski, F. Lombaedo, B. W. Dominy, P. J. Feeney, Adv. Drug Delivery Rev. 1997, 23, 3-25.

12 D. E. Ciaek, S. D. Pickett, Drug Discovery Today 2000, 5, 4958.

13 J. Sadowski, C. Rudoiph, J. Gasteigee, Tetrahedron Comput. Methodol. 1990, 3, 537-547.

14 Corina 2.1, Molecular Networks GmbH Computerchemie, Erlangen 1998.

15 Concord, Tripos Associates Inc., St. Louis, Missouri, USA.

16 E. M. Landau, Biomed. Health Res. 1998, 20, 23-38.

17 M. A. Johnson, G. M. Maggioea, Concepts and Applications of Molecular Similarity, John Wiley & Sons, New York 1990.

18 P. M. Dean, Molecular similarity in drug design, Chapman & Hall, London 1995.

19 G. M. Downs, P. Wiiiett, in K. B. Lipkowitz, D. B. Boyd (Eds.): Reviews in Computational Chemistry, Vol. 7, VCH Publishers, New York 1996, p. 1-57.

20 R. D. Beown, Y. C. Maetin, J. Chem. Inf. Comput. Sci. 1996,

36, 572-584.

21 T. Pöttee, H. Mattee, J. Med. Chem. 1998, 41, 478-488.

22 H. Mattee, M. Raeey, in G. Jung (Ed.): Combinatorial Organic Chemistry, Wiley-VCH, New York 1999, p. 409-439.

23 M. F. Lynch, Screening large chemical data files, Ellis Horwood, Chichester 1975.

24 R. Niiakantan, N. Bauman, J. S. Dixon, R. Venkataeaghavan, J. Chem. Inf. Comput. Sci. 1987, 27, 8285.

25 Daylight Software Manual, Daylight Inc., Mission Viejo, California, USA.

26 MACCS II, MDL Information Systems Inc., San Leandro, California, USA.

27 Unity Chemical Information Software, Version 4.0, Tripos Associates Inc., St. Louis, Missouri, USA.

28 D. R. Fiowee, J. Chem. Infi. Comput. Sci. 1998, 38, 379-386.

29 P. Wiiiett, J. Chem. Inf. Comput. Sci. 1998, 38, 983-996.

30 R. D. Beown, Y. C. Martin, J. Chem. Inf. Comput. Sci. 1997,

31 G. Moeeau, P. Beoto, Nouv. J. Chimie 1980, 4, 359-360.

32 G. Schneidee, W. Neidhaet, T. Giiiee, G. Schmid, Angew. Chem. Int. Ed. 1999, 38, 2894-2896.

33 P. R. Geebee, Proceedings of IV Girona Seminar on Molecular Similarity, Girona, Spain, July 5-7, 1999.

34 P. R. Geebee, K. Möiiee, J. Comput.-Aided Mol. Design 1994, 9, 251-268.

35 P. R. Geebee, J. Comput.-Aided Mol. Design 1998, 12, 37-51.

36 M. Raeey, J. S. Dixon, J. Comput.-Aided Mol. Design 1998, 12, 471-490.

37 Y. C. Maetin, M. G. Buees, R. D. Brown, Pharm. Pharmacol. Commun. 1998, 4, 147-152.

38 P. A. Bath, A. R. Poieette, P. Wiiiett, J. Chem. Inf. Comput. Sci. 1994, 34, 141-147.

39 Y. C. Maetin, M. G. Buees, P. Wiiiett, in K. B. Lipkowitz, D. B. Boyd (Eds.): Reviews in Computational Chemistry, Vol. 1, VCH Publishers, New York 1990, p. 213-263.

40 A. C. Good, J. S. Mason, in K. B. Lipkowitz, D. B. Boyd (Eds.): Reviews in Computational Chemistry, Vol. 7, VCH Publishers, New York 1996, p. 67-117.

41 A. C. Good, T. J. A. Ewing, D. A. Gschwend, I. D. Kuntz, J. Comput.-Aided Mol. Design 1995, 9, 1-12.

42 S. D. Pickett, J. S. Mason, I. M. McLay, J. Chem. Inf. Comput. Sei. 1996, 36, 1214-1223.

43 J. S. Mason, I. Morize, P. Menard, D. L. Cheney, D. L. Hulme, R. F. Labaudiniere, J. Med. Chem. 1999, 42, 32513264.

44 S. K. Kearsley, G. M. Smith, Tetrahedron Comput. Methodol. 1990, 3, 615-630.

45 M. D. Miller, R. P. Serhidan, S. K. Kearsley, J. Med. Chem. 1999, 42, 1505-1514.

46 C. Lemmen, T. Lengauer, G. Klebe, J. Med. Chem. 1998, 41, 4502-4520.

47 D. A. Thorner, D. J. Wild, P. Willett, P. M. Wright, in H. Kubinyi, G. Folkers, Y. C. McMartin (Eds.): 3D QSAR in drug design: ligand protein interactions and molecular similarity, Vol. 9/10/11, Kluwer/Escom, Dordrecht 1998, p. 301-320.

48 C. Lemmen, T. Lengauer, in K. Gundertofte, F. S. Jorgensen (Eds.): Molecular modelling and prediction of bioactivity, Proceedings of the 12th European symposium on quantitative structure-activity relationships, Plenum Press, New York 1999, 169-174.

49 C. Lemmen, T. Lengauer, J. Comput.-Aided Mol. Design 2000, 14, 215-232.

50 L. M. Kauvar, D. L. Higgins, H. O. Villar, J. R. Sportsman, A. Engvist-Goldstein, R. Bukar, K. E. Bauer, H. Dilley,

51 H. Briem, I. D. Kuntz, J. Med. Chem. 1996, 39, 3401-3408.

52 J. M. Blaney, J. S. Dixon, Perspectives in Drug Discovery and Design, Vol. 1, Kluwer/Escom, Dordrecht 1993, p. 301-319.

53 I. D. Kuntz, E. C. Meng, B. K. Shoichet, Acc. Chem. Res. 1994, 27, 117-123.

54 P. M. Colman, Curr. Op. Struct. Biol. 1994, 4, 868-874.

55 T. P. Lybrand, Curr. Op. Struct. Biol. 1995, 5, 224-228.

56 G. Jones, P. Willett, Curr. Op. Biotechnology 1995, 6, 652656.

57 T. Lengauer, M. Rarey, Curr. Op. Struct. Biol. 1996, 6, 402406.

58 H. Kubinyi, Curr. Op. Drug Disc. Dev. 1998, 1, 16-27.

59 M. Rarey, S. Wefing, T. Lengauer, J. Comput.-Aided Mol. Design 1996, 10, 41-54.

60 M. Rarey, B. Kramer, T. Lengauer, G. Klebe, J. Mol. Biol. 1996, 261, 470-489.

61 M. Rarey, B. Kramer, T. Lengauer, J. Comput.-Aided Mol. Design 1997, 11, 369-384.

62 B. Kramer, M. Rarey, T. Lengauer, Proteins 1997, Suppl. 1, 221-225.

63 M. Rarey, B. Kramer, T. Lengauer, Proteins 1999, 34, 17-28.

64 M. Rarey, B. Kramer, T. Lengauer, Bioinformatics 1999, 15, 243-250.

65 B. Kramer, M. Rarey, T. Lengauer, Proteins 1999, 37, 228241.

66 G. Klebe, F. Dullweber, H.-J. Bohm, in R. B. Baffa (Ed.): Thermodynamics of the Drug-Receptor Interaction, John Wiley & Sons, Inc., New York, in press.

67 Ajay, M. A. Murcko, J. Med. Chem. 1995, 38, 4953-4967.

68 T. I. Oprea, G. R. Marshall, in H. Kubinyi, G. Folkers, Y. C. McMartin (Eds.): 3D QSAR in drug design: ligand protein interactions and molecular similarity, Vol. 9/10/11, Kluwer/ Escom, Dordrecht 1998, p. 3-17.

69 R. M. A. Knegtel, P. D. J. Grootenhuis, in H. Kubinyi, G. Folkers, Y. C. McMartin (Eds.): 3D QSAR in drug design: ligand protein interactions and molecular similarity, Vol. 9/10/11, Kluwer/Escom, Dordrecht 1998, p. 99-114.

70 J. D. Hirst, Curr. Op. Drug Disc. Dev. 1998, 1, 28-33.

71 H.-J. Bohm, M. Stahl, Med. Chem. Res. 1999, 9, 445-462.

72 J. R. H. Tame, J. Comput.-Aided Mol. Design 1999, 13, 99-108.

73 H.-J. Bohm, J. Comput.-Aided Mol. Design 1994, 8, 243-256.

74 H. Gohlke, M. Hendlich, G. Klebe, J. Mol. Biol. 2000, 295, 337-356.

75 F. C. Bernstein, T. E. Koetzle, G. J. B. Williams, J. Meyer, E. F., M. D. Brice, J. R. Rodgers, O. Kennard, T. Shimanouchi, T. M., J. Mol. Biol. 1977, 112, 535-542.

76 C. A. Fink, Exp. Opin. Ther. Patents 1996, 6, 1147-1164.

77 P. K. Chakravarty, Exp. Opin. Ther. Patents 1995, 5, 431-458.

78 L. Costantino, G. Rastelli, G. Cignarelia, P. Vianello, D. Barlocco, Exp. Opin. Ther. Patents 1997, 7, 843-858.

79 N. Ashizawa, T. Aotsuka, Drugs of the Future 1998, 5, 521529.

80 J. S. Carter, Exp. Opin. Ther. Patents 1997, 8, 21-29.

81 R. W. Friesen, C. Brideau, C. C. Chan, S. Charleson,

D. Deschenes, D. Dube, D. Ethier, R. Fortin, J. Y. Gauthier, Y. Girard, R. Gordon, G. M. Greig, D. Riendau, C. Savoie, Z. Wang, E. Wong, D. Visco, L. J. Xu, R. N. Young, Bioorg. Med. Chem. Lett. 1998, 8, 2777-2782.

82 A. S. Kalgutkar, Exp. Opin. Ther. Patents 1999, 9, 831-849.

83 R. Asianian, J. J. Piwinski, Exp. Opin. Ther. Patents 1997, 7, 201-207.

84 R. A. Crusciel, K. R. Romines, Exp. Opin. Ther. Patents 1997,

85 J. R. Proudfoot, Exp. Opin. Ther. Patents 1998, 8, 971-982.

86 C. H. Gleiter, H.-P. Volz, Exp. Opin. Invest. Drugs 1996, 5, 409-419.

87 S. Jegham, P. George, Exp. Opin. Ther. Patents 1998, 8, 11431150.

88 R. P. Beckett, M. Whittaker, Exp. Opin. Ther. Patents 1998,

89 M. R. Wiley, M. J. Fisher, Exp. Opin. Ther. Patents 1997, 7, 1265-1282.

90 P. E. J. Sanderson, A. M. Nayor-Olsen, Curr. Med. Chem. 1998, 5, 289-304.

91 A. K. Chakraborty, H. K. Majumder, C. P. Hodgson, Exp. Opin. Ther. Patents 1994, 4, 655-668.

92 D. Weininger, J. Chem. Inf. Comput. Sci. 1988, 28, 31-36.

93 M. Ciark, R. D. Cramer III, N. Van Opdenbosch, J. Comp. Chem. 1989, 10, 982-1012.

94 R. A. Jarvis, E. A. Patrick, IEEE Transactions on Computers 1973, C22, 1025-1034.

95 Daylight Chemical Information Systems, San Leandro, California, USA.

96 M. Clark, R. D. Cramer III, N. Van Opdenbosch, J. Comp. Chem. 1989, 10, 982-1012.

97 Sybyl molecular modeling software, Version 6.2, Tripos Associates Inc., St. Louis, Missouri, USA.

98 R. R. Sokal, F. J. Rohlf, Biometry: The principles and practice of statistics in biological research, W. H. Freeman and Company, New York 1995.

99 J. C. Hsu, Multiple comparisons, theory and methods, Chapman & Hall, London 1996.

100 G. J. Hanson, Exp. Opin. Ther. Patents 1997, 7, 729-733.

101 http://www.protherics.com/crunch/

102 P. R. Gerber, K. Müller, J. Comput.-Aided Mol. Design 1995, 9, 251-268.

103 M. D. Eldrige, C. W. Murray, T. R. Auton, G. V. Paolini, R. P. Mee, J. Comput.-Aided Mol. Design 1997, 11, 425-445.

104 C. A. Baxter, C. W. Murray, D. E. Clark, D. R. Westhead, M. D. Eldridge, Proteins 1998, 33, 367-382.

105 I. Muegge, Y. C. Martin, J. Med. Chem. 1999, 42, 791-804.

106 J. B. O. Mitchell, R. A. Laskowski, A. Alex, J. M. Thornton,/. Comp. Chem. 1999, 20, 1165-1177.

107 M. Stahl, H.-J. Bühm, J. Mol. Graphics Mod. 1998, 16, 121132.

108 P. S. Charifson, J. J. Corkery, M. A. Murcko, W. P. Walters, J. Med. Chem. 1999, 42, 5100-5109.

Future Trends

Thomas Lengauer

The previous Chapters of this two-volume series have gone into substantial technical detail of many aspects and sub-problems of bioinformatics, especially as they pertain to pharmaceutical applications. This concluding Chapter addresses the following questions:

1. How does the progress in genomics and bioinformatics change our view on biology and medicine?

2. What are the main coming challenges for bioinformatics?

3. What are the well founded expectations and intrinsic limitations of bio-informatics?

We will address each question in turn and, in line with the purpose of this book, confine our remarks to the medical and pharmaceutical field.

How does the progress in genomics and bioinformatics change our view on biology and medicine?

The ability of sequencing has revolutionized the view we have on biological systems. Eric Lander [1] drew the rather fitting comparison with the discovery and consolidation of the periodic table of elements in chemistry, of about 130 years ago. That process reduced the seemingly limitless variety of different forms of matter to a rather small set of then less than 100 elements. This insight, in some sense, completed the vocabulary of chemistry and set the starting point of a remarkable development of the field of chemistry towards a quantifiable discipline in which rational approaches gradually gained ground from the "alchemists" approach of trial and error and in which understanding processes became a significant aspect of discovery.

Bioinformatics - From Genomes to Drugs. Volume II: Applications. Edited by Thomas Lengauer Copyright © 2002 WILEY-VCH Verlag GmbH, Weinheim ISBN: 3-527-29988-2

It seems that we are at a comparable point in biology these days. Granted, the vocabulary is much larger. Genomes have up to billions of base pairs, there are tens of thousands of genes in complex organisms; and the number of different proteins probably goes into the millions, if post-translational modifications are taken into account. Yet, the finiteness of the genome is one of the central insights that come with the new biology. If the genome does not have a certain gene, it simply is not there; and the organism has to and can get by without it and achieve those respective goals by other means. Thus, insights into genes which do not exist in genomes can be a starting point for entirely new lines of biological reasoning.

The other central insight that is afforded by whole genome sequencing is the idea of a genomic map of the organism. For the first time, we can globally lay out the structure and components of a genome. The rough locations of genes, of repeat regions, of gene duplications become available; and they can be studied within a single genome and even in comparison with other genomes. This opens the time dimension of evolutionary development of the species, which is another rich source of essential biological information. Expression data add another dimension of information that can distinguish between cell types and cell states. SNP data allow for distinguishing between individuals of a species.

This global view affords us with two new options. First, we can try networking the detailed local knowledge that we have on a number of biological subsystems. Second, we can try to address biological questions without a preset hypothesis in mind. This approach is quite new to biology, where experiments traditionally are designed to validate or falsify hypotheses. Now, the ability to screen through the complete set of genomic data affords what has recently been called "discovery-based research", research that does not aim at validating or falsifying an a priori hypothesis, but mines the genome-wide data in search of a suitable hypothesis. Discovery-based research not only requires a comprehensive set of data but also highly developed and very strong computer modeling and analysis methods that allow for forming relevant hypotheses based on the data. For reasons that we will discuss later in this Chapter, bioinformatics is not very far yet along the road of providing such analysis methods. As an example, note that protein structure prediction still needs to be equipped with a set of hypotheses - in the form of previously resolved protein structures that are used for templates - in order to make accurate predictions (See chapters 5 and 6 of volume 1). There are other examples that show that providing hypotheses is essential for rendering bioinformatics analyses effective, see e.g. [2].

The gradual change from hypothesis-driven to discovery-based research also weakens the dominance of reductionism that we have observed in biology in the past. The reductionist approach thinks of a biological system as decomposed into subsystems that are studied independently. The finer the decomposition, the smaller the subsystem and the easier it is to form hypotheses on it and test them experimentally. As we are looking at the genomic data, we begin to see how important the interdependencies, links and communications between the various subsystems really are. The systemic view of the organism gains increasing importance; and this has strong repercussions on the view of bioinformatics. Bioinformatics has to become the basic tool for facilitating this integrated view on the organism. This means that bioinformatics has to offer integrated software and data, an issue that we will elaborate on further in this Chapter.

We have to develop the experimental and bioinformatics methods in order to render discovery-based research effective. This will be an incremental and continuous process that will take a lot of time after the revolutionizing event of providing the human genome sequence. The consequences for biology will be dramatic, and they are already taking effect. Biology is becoming an information handling discipline. Bioinformatics methods have already become the third major pillar of this scientific field, beside laboratory experimentation, and theory. All biologists will need to have basic understanding of this part of biology in order to exercise successful science. This has been recognized to a different extent in different countries. Several countries have started sizeable projects in bioinformatics education. In some of them, the resulting explosive growth of the field is putting the general capabilities of the system of scientific education to the test.

Thus computer-based information handling will be a basic aspect of biology. At the same time, the biological view of an organism will increasingly become that of an information handling system. After all, structure and dynamics is what characterizes biological systems. Organism-wide handling of information - be it gene-regulatory systems, in the immune system or in the nervous system - will be one of the main topics of research in coming years. This is all the more the case, as many diseases are based on malfunctions of these systems.

In summary, the high level attention that has been given to the biological and the informatics part of sequencing the human genome is justified. At the beginning of the postgenomic era we are experiencing a paradigm shift in biology. The field is preparing for it, and part of this preparation is to inject the field with information-handling expertise. This book contributes to this effort.

What are the main coming challenges for bioinformatics?

There are several different alleys for approaching this question. The first is via new kinds of experimental data that are showing up on the horizon and that have to be supplemented with computer methods. The second is by the kinds of methods that allow for more accurate analysis of the existing data. The third is by finding integrated views on the biological data in order to extract more global insights of the workings of an organism.

New experimental data

In former days, (up to the beginning of the nineties) the bulk of bio-informatics work was directed at analyzing and comparing - mostly aligning - biological sequences, DNA (mostly genes), RNA or protein (see Chapters 2 and 3 of volume 1). Databases like Genbank, the EMBL nucleotide sequence database, and SWISSPROT were the central publicly available data repositories. Work on protein structure snowballed in the early to mid nineties, as the number of significantly different protein structures deposited in the Protein Data Bank grew into the low hundreds [3, 4] and thus was beginning to allow for the derivation of useable potential functions for protein structure prediction (see Chapters 5 and 6 of volume 1). The first whole bacterial genomes appeared in the mid-1990s and immediately -even with the first publication of the genome sequence - attempts at genome-wide annotation with bioinformatics methods were made [5]. In close succession, different genomes were compared with each other, see e.g. [6]. Still today, the annotation of whole genomes and their comparison remains one dominating challenge that drives the development of bioinformatics methods.

The next data to enter the scene were mRNA expression data (see Chapter 5 of volume 2). These data appeared in appreciable volumes in the mid nineties, first as ESTs [7] and then in various forms of micro arrays [8]. It is fair to say that, today, from the application point of view a large part of scientific and industrial bioinformatics is preoccupied with analyzing these data. Transcending simple clustering approaches, the developed methods are now entering their second generation. Yet expression data analysis is still hampered by three issues:

• The measured data are quite noisy. (Technological advances may help that.) Normalization of expression data and comparability between different experiments is a difficult issue. (This problem has to be addressed by both experimental and computer advances.)

• Appropriate methods for interpreting the data have yet to be developed. (Methods are being refined.)

• Since we are measuring on the mRNA level the expression levels contain only indirect information about the concentration of the final protein product. Therefore, links between expression levels and gene function can be expected to be indirect. (This problem may be alleviated partially with time, as we are deepening our understanding about the dependencies between mRNA expression and protein function. It will probably not go away completely, because the dependencies are expected to be very involved.)

It is hard to predict the future of mRNA expression measurement. What we can be sure of is that proteomics will gain ground from genomics as increasing amounts of data on expression levels of mature proteins become available. By the same token, mRNA expression measurement will probably continue to have its applications, e.g. in diagnostics, especially since genom-ics experiments are and can be expected to remain cheap and easy compared with proteomics experiments.

We see two major appearing frontiers for new kinds of molecular data. The first is proteomics (See Chapter 4 of volume 2) and metabolomics. With a combination of 2D gel, mass spectrometry, protein microarray and yeast-two-hybrid methods, a large amount of protein sequence, expression, and interaction data will be produced on a cell-wide level. On the one hand, bioinformatics has to address the challenge of interpreting these data. On the other hand, especially the protein interaction data will provide an interesting basis for probing deeper into the details of regulatory networks. Such data are collected in special protein interaction databases such as DIP [9, 10] and BIND [11].

The first metabolic databases, among them KEGG [12], ENZYME [13], Ecocyc [14], and BRENDA [15] were started in the mid-1990s. Today, they harbor data on well over 100 000 different catalyzed reactions. First, these databases had records, one for each catalyzed metabolic reaction, then a graphic view was added showing the topology of pathways and metabolic cycles. Recently some databases are increasingly being equipped with analysis tools that allow for pathway-oriented navigation and searching.

What is still largely missing is data on enzyme kinetics and metabolite flow. Currently, several groups are starting to gear up for projects that perform cell-wide measurements of metabolites. The corresponding data are supposed to provide the foundation for cell-wide modeling of metabolic networks. The corresponding machinery for simulating metabolic networks is in more than just a nascent state of development. Linear and nonlinear approaches have been introduced [16, 17]. Originally, these methods have only been exercised on comparatively small networks, such as the glycolysis or the pentose-phosphate pathway [18]. Today, metabolic networks are analyzed on an organism-wide scale, at least with respect to their topology. Linear and convex methods of control flow analysis have been applied for this purpose. This includes the analysis of minimal substrate requirements needed to support the production of the essential metabolic products for the organism and the criticality of certain genes for this purpose [19]. The software system METATOOL is a generally available basis for this kind of analysis [20]. Nonlinear analysis for the simulation of enzyme kinetics, as offered by the software system GEPASI [21, 22], is also gearing up for this kind of scale-up. The lack of kinetic data has curbed the development of this field so far. This obstacle is now on the verge of being removed. This poses an acute demand on the development of mathematical methods, especially, since the field of modeling metabolic networks reaches deeply into the theory of nonlinear systems.

Databases on regulatory relations and pathways are in an earlier stage of development than those on metabolic relations. The first databases on regulatory relations appeared in the late nineties. Today databases such as Transpath [23] and CSNDB [24] contain a few thousand such relations.

The second kind of data that we are starting to be confronted with in large number is data on genetic variations. Stemming from sequencing projects, these data mostly take the form of single-nucleotide polymorphisms (SNPs, see Chapter 3 of volume 2). The wide-spread availability of data on genetic variation will prompt the methodological aspects of bioinformatics and statistical genetics to merge. These two disciplines have not had much interaction, in the past. Bringing them together will be both challenging and very rewarding. It will provide a basis for pharmacogenetics - where we try to analyze and eventually predict the differential effect of a drug on different patients based on the genotypic differences among the patients - and pharmacogenomics - were we aim at selecting the right drug therapy among different alternatives for a single patient, again involving genotype patient data. A first large-scale project underway in this context is the deCODE project that genotypes large parts of the Icelandic population with the aim of understanding the correlation between drug action and effectiveness and genotype as well as obtaining more insights into the genetic basis of complex multi-factorial diseases such as diabetes and coronary syndromes [25]. A necessary ingredient for the latter task is ample genealogical data, which the people of Iceland can supply. Furthermore, a certain genetic homogeneity such as the Icelandic population is hoped to exhibit help in uncovering such correlations. This project has stirred large controversy and raised not only scientific but also political and ethical issues [26].

New analysis methods

The picture of what a bioinformatician does used to be dominated by the BLAST search. Well, maybe, it is more PSI-BLAST, today. In short, a bio-informatician was envisioned doing turn-key analysis of biological data using a very small set of programs, sometimes even a single program. In recent years bioinformatics has become much more than that and, at the same time, not reached the level of expectation yet with which it is confronted by biologists.

The great variety of bioinformatics tools and data is best pointed to in Chapter 1 of volume 2. There are literally dozens of analysis problems for data from hundreds of molecular biology databases. The whole picture can only be shaped if we exercise these methods on these data in an integrated fashion. Bioinformatics has become a very complex task.

At the same time both the methods and the data are in an early stage of development. Methods make analyses and predictions whose outcomes are not certain to be correct. Databases are more or less cured. In general, they contain a lot of undetected contradictions and errors. Improvement of this situation encompasses method refinement, curation of databases and the integration of heterogeneous data and methods. The former is a mostly scientific topic, the latter two have a sizeable organizational component.

7.2.2.1 Method refinement

Today, analysis and prediction methods have mostly a statistical flavor. They are trained on classified data and make new classifications or predictions based on statistical models. In some sense, all Chapters of Volume 1 present such methods. For instance, the methods for homology-based protein structure prediction in Chapter 5 and 6 of volume 1 learn from a set of observed structures rules that predict alike structures.

In fact, it is reasonable to say that this way of going at things is characteristic for today's method spectrum of bioinformatics and sets it off against other fields in computer-based modeling, which use other approaches. Due to the discrete quality of genomic data and the kinds of statistics involved, the analysis methods are mostly discrete, i.e. relying on models that are based on sets of integers or finite objects such as graphs. There are exceptions to this rule, e.g. in the areas of analyzing expression patterns, simulating metabolic pathways, and detailed modeling of energy. But the respective fields and methods can be viewed as peripheral to bioinformatics as we experience the field today. (This may change in the future.)

If we can clean up the structural database and extend the set of data on known structures, we can increase the accuracy of structure predictions. We can also do so by advancing the methods of statistical learning that underlie the prediction machinery. The progress in both areas will add up to improve the predictions.

We cannot be sure of the answers provided by statistical methods. But if they are developed appropriately, they provide a significance estimate, such as the P-value that was the main claim to fame of the BLAST program and its real innovation. This value tells us how much we can believe an analysis or prediction, in the case of BLAST a local sequence match witnessed by an alignment. Recently several other bioinformatics analysis methods have been equipped with theoretically founded or heuristically fitted significance estimates (see Chapters 2 and 6 of Volume 1). Only few bioinformatics tools come with this kind of significance estimate today, and we need more of them.

Establishing high significance of a prediction still does not explain this prediction. One disadvantage with learning from experience is that it is hard to make such explanations. In order to explain things, the model has to be based on some kind of theory on the structure or process to be predicted. Models that are the basis of learning from experience can have different degrees of diagnosability. E.g. a general feed-forward neural network has no discernable internal structure that reflects biological fact. Thus it is hard to explain why the network makes the predictions that it does. It just happens to be trained that way. On the other hand, a profile hidden Markov model (see Appendix) reflects the internal structure of a biomolecular sequence. The parameters that are trained in the HMM show where secondary structure elements are most likely to be located and where insertions of loops are to be expected. There is some higher degree of diagnosability to hidden Markov models than to neural nets. The force fields that are used in many molecular dynamics simulations are hardly trained at all. Their structure is derived from (some simplification of) the structure of the basic laws of kinetics. The respective parameters are often deduced from quantum-mechanical simulations of small molecules. Mostly, a few parameters remain to be calibrated. Also, there is some heuristic element to the simplification of the structure of the force laws, e.g., by eliminating or grouping multi-body terms. Thus, even ab initio calculations have a heuristic element. Yet, it is evident that, as we inject more solidly based theory into the analysis and reduce the purely heuristic or phenomenological aspects of the model, we can hope to better explain what the model is predicting.

Bioinformatics has a long way to go along this road. As we have mentioned above, even the estimation of the significance of a prediction is a rare event in bioinformatics analyses today. Putting more theory into the model is going to be a long and laborious process that relies on the incremental progress that is going to be made in the understanding of biological systems. This process is going to require the advances of the whole field of biology. Put concisely, we are at a very early scientific stage for theorizing about biological systems. Propelling this process is going to be one of the most exciting points of interaction between experimental and computer-based aspects of the field.

7.2.2.2 Curing databases

This is a very difficult issue. Many people are even desperate about it. First, we cannot start a new world from scratch. We have to live with the imperfect databases of today and manage their cleanup as best possible. Simultaneously we have to cope with their dramatic growth. Second, the problem

Was this article helpful?

0 0