Field of Science

The anti-question, or when bias can be a good thing

A recent publication indicates that more bias in the form of natural product scaffolds not yet synthesized could improve hit rates in screening

Most drug discovery projects are inaugurated with some kind of screening campaign where millions of molecules are screened against a biological target. Even though the hit rate from High-Throughout Screening (HTS) can be quite low, HTS still provides one of the best starting points to discover interesting new structures that display biological activity. In spite of this, there is frequent disappointment at the low rates from HTS which could be as low as 0.05%.

But instead of focusing on the low hit rate from HTS, what if we express surprise that this hit rate is actually high? This thought takes me into a slight digression. In his remarkable book The Black Swan, the author Nassim Nicholas Taleb talks about an "anti-library", the set of all books you have not read. The anti-library is in some ways more important than your library because it really tells you what you are ignorant about.

Similarly we can define an "anti-question". The anti-question is a question opposite to one which we might usually ask. So instead of asking; "Why is this drug specific for this protein?", we could ask "Why is this drug not hitting other proteins?". The value of the anti-question is that it forces us to analyze and evaluate things that we otherwise may not and enables us to think outside the box. As the wise doctor constantly exhorts detective Sponer in "I Robot" to get to the all-important right question, so it could be important to get to the right anti-question.

In the context of HTS, the anti-question actually turns out to be logical. Instead of asking, "Why is the hit rate from HTS so low"?, one should ask "Given the number of small molecules in small-molecule space (~10*60) compared to the extremely low number typically screened in HTS campaigns (10*6), why should we get any hits from HTS at all?". Even narrowing down the unimaginably large small-molecule universe to more drug-like or lead-like entities, we still run into a numbers paradox since even this number is orders of magnitude greater than what is usually screened.

In their most recent paper, Brian Shoichet and his team ask this important anti-question, and it leads them down an interesting road. Most campaigns that screen libraries focus on readily available commercial compounds and fragments that can be synthesized by organic chemists. This bias in turn reflects what has been more or less synthetically accessible through more than a hundred years of synthesis. Compared to this, the Kyoto Encyclopedia of Genes and Genomes (KEGG) contains metabolites whose structures are untainted by the minds of organic chemists. These are scaffolds among secondary metabolites and natural products that have simply been found.

There is another set of structures; the Generated Database (GDB), a theoretical set which contains all possible molecules containing less than 11 heavy atoms consisting of first-row elements (C, O, N, F). This number is not as large as may be imagined and amounts to about 26 million. In the study the authors essentially compare the set of purchasable or commercial KEBB compounds found in their own annotated library called ZINC with the GDB. They use a similarity measure called a Tanimoto coefficient derived from 2D fingerprint comparison to accomplish this. 2D fingerprints use different kinds of protocols for breaking up a molecule into bit strings and then compare bit strings by distances and atom types.

The comparison indicates something interesting; the compounds in the purchasable set are much more similar to the KEBB compounds than are the compounds from the rest of the GDB. In other words, purchasable compounds contain scaffolds that are biased towards those in the KEBB. This is a good thing, since metabolites are usually primed by nature to show at least some biological activity. Another noteworthy finding was that the bias also increased with molecular size, as compounds became more drug-like or lead-like in terms of size.

However, the more surprising and useful observation was that there are hundreds of scaffolds in the KEBB that are notpresent in the commercial library. The authors also do this comparison for other popular commercial libraries designed specifically for screening and find a similar result. The bottom line; while synthesized commercial libraries of molecules show a bias toward natural products and metabolites, there are also several natural product scaffolds that are not found in these libraries.

So what is the prescription? Introduce further bias! The compounds in the KEGG are more or less optimized for biological activity. If their scaffolds are not yet present in the commercial libraries, organic chemists should go ahead and focus on synthesizing these scaffolds and adding them to screening libraries. More such scaffolds could increase the hit rate in HTS by enriching libraries in biologically relevant scaffolds. Of course the usual caveats of false positives and promiscuous compounds should be kept in mind, and it's also not clear that proteins like kinases which are optimized to bind certain core scaffold structures would greatly benefit from these diverse scaffolds. But in terms of unmined drug space, introducing such further bias would be beneficial.

This study again goes to show the possibilities for finding new stars in the constellations and galaxies of the drug universe. Hopefully the universe will keep on expanding.

Hert, J., Irwin, J., Laggner, C., Keiser, M., & Shoichet, B. (2009). Quantifying biogenic bias in screening libraries Nature Chemical Biology DOI: 10.1038/nchembio.180


  1. How much does Diversity-Oriented Synthesis libraries cover these new scaffolds ?

  2. DOS is also finally subject to the techniques used by organic chemists in forming carbon-carbon bonds. I am sure it helps a little, but the challenges are not the same posed by nature, thus leading to limited scaffold diversity.


Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="">FoS</a> = FoS