There is much discussion in the literature about the relevance of different descriptors and diversity selection processes [28-35]. Although 3D descriptors can encode more information on molecules, it was shown that 2D descriptors perform better in separating biologically active molecules from inactive ones. Hierarchical clustering in combination with MACCS 2D descriptors performed best in several different datasets of enzyme assays, high-throughput screening results, and activities in an monoamine oxidase assay. Jarvis-Patrick clustering was shown to be rather ineffective since it generated either singleton clusters of actives or very large clusters containing the actives together with many inactives. Using hierarchical clustering, it was possible to separate active clusters containing 30-50% actives on average whereas only 0.82% of the whole dataset was active. It appeared that a reasonable cluster size is around five molecules per cluster, which means that about 20% of a dataset still has to be sampled in order to obtain valid results . The MACCS structural keys also performed best in predicting hydrophobicity, electrostatics, sterics, dispersion interaction, and hydrogen bonding of molecules. These factors are relevant for ligand-receptor interaction and molecular recognition. This result is the reason why this descriptor system performs well in biological clustering .
Another statistical evaluation of random selection methods and of designed compound subsets for biological screening systems proved that the latter were, indeed, superior. About 3.5 times more compounds have to be sampled randomly to obtain results equivalent to a rational design approach. For initial screening, a Ta-nimoto similarity radius of 0.7 between similar compounds in the subset is expected to be sufficient to cover relevant biological effects, whereas, for lead refinement, a Tanimoto similarity radius of 0.85 is recommended .
The question of which level of diversity of combinatorial libraries should be assessed has also been widely discussed in the literature. Three main strategies have been proposed : reactant-based design means that the diversity of the reagents is evaluated and these diverse reactants are assumed to give a diverse product library. The number of reagents is orders of magnitude lower than the number of products, therefore this design process is very fast. However, it is not guaranteed that a maximum diverse set of reagents also provides a maximum diverse set of products. Therefore, product-based design uses fully enumerated combinatorial libraries and picks the most diverse products thereof. During this process, the combinatorial nature of the products gets lost and, therefore, the process loses efficiency. A compromise between these two approaches is a reactant-biased/product-based algorithm. This means maintaining the combinatorial nature of chemistry at the reactant level, but optimizing the diversity at the product level by use of mathematical models such as genetic algorithms and simulated annealing . A general problem with these product-based approaches is that diversity has to be calculated for the whole virtual library, which needs very large computational resources. Which of these methods is finally more efficient remains unclear. On one hand, there are reports which show that diversity design at the product level results in more diverse libraries than if design is performed at the reagent level . On the other hand, it could be shown that efficiency of selection is equivalent both at the product level and at the reagent level if the number of building blocks is sufficiently high. Just with a very few building blocks (e.g. <7), diversity at product level is significantly higher since unique features might not be represented in the building blocks .
Was this article helpful?