Field of Science

Showing posts with label structure-based drug design. Show all posts
Showing posts with label structure-based drug design. Show all posts

Bottom-up and top-down in drug discovery

There are two approaches to discovering new drugs. In one approach drugs fall in your lap from the sky. In the other you scoop them up from the ocean. Let’s call the first the top-down approach and the second the bottom-up approach.

The bottom-up approach assumes that you can discover drugs by thinking hard about them, by understanding what makes them tick at the molecular level, by deconstructing the dance of atoms orchestrating their interactions with the human body. The top-down approach assumes that you can discover drugs by looking at their effects on biological systems, by gathering enough data about them without understanding their inner lives, by generating numbers through trial and error, by listening to what those numbers are whispering in your ear.

To a large extent, the bottom-up approach assumes knowledge while the top-down approach assumes ignorance. Since human beings have been ignorant for most of their history, for most of the recorded history of drug discovery they have pursued the top-down approach. When you don't know what works, you try things out randomly. The Central Americans found out by accident that chewing the bark of the Cinchona plant relieved them of the afflictions of malaria. Through the Middle Ages and beyond, people who called themselves physicians prescribed a witches' brew of substances ranging from sulfur to mercury to arsenic to try to cure a corresponding witches' brew of maladies, from consumption to the common cold. More often than not these substances killed patients as readily as the diseases themselves.

The top-down approach may seem crude and primitive, and it was primitive, but it worked surprisingly well. For the longest time it was exemplified by the ancient medical systems of China and India – one of these systems delivered an antimalarial medicine that helped its discoverer bag a Nobel Prize for Medicine. Through fits and starts, scores of failures and a few solid successes, the ancients discovered many treatments that were often lost to the dust of ages. But the philosophy endured. It endured right up to the early 20th century when the German physician Paul Ehrlich tested 604 chemical compounds - products of the burgeoning dye industry pioneered by the Germans - and found that compound 606 worked against syphilis. Syphilis was a disease that so bedeviled people since medieval times that it was often a default diagnosis of death, and cures were desperately needed. Ehrlich's 606 was arsenic-based, unstable and had severe side effects, but the state of medicine was such back then that anything was regarded as a significant improvement over the previous mercury-based compounds.

It was with Ehrlich's discovery that drug discovery started to transition to a more bottom-up discipline, systematically trying to make and test chemical compounds and understand how they worked at the molecular level. But it still took decades before the approach bore fruition. For that we had to await a nexus of great and concomitant advances in theoretical and synthetic organic chemistry, spectroscopy and cell and molecular biology. These advances helped us figure out the structure of druglike organic molecules, they revealed the momentous fact that drugs work by binding to specific target proteins, and they also allowed us to produce these proteins in useful quantity and uncover their structures. Finally at the beginning of the 80s, we thought we had enough understanding of chemistry to design drugs by bottom-up approaches, "rationally", as if everything that had gone on before was simply the product of random flashes of unstructured thought. The advent of personal computers (Apple and Microsoft had launched in the late 70s) and their immense potential left people convinced that it was only a matter of time before drugs were "designed with computers". What the revolution probably found inconvenient to discuss much was that it was the top-down analysis which had preceded it that had produced some very good medicines, from penicillin to thorazine.

Thus began the era of structure-based drug design which tries to design drugs atom by atom from scratch by knowing the protein glove in which these delicate molecular fingers fit. The big assumption is that the hand that fits the glove can deliver the knockout punch to a disease largely on its own. An explosion of scientific knowledge, startups, venture capital funding and interest from Wall Street fueled those heady times, with the upbeat understanding that once we understood the physics of drug binding well and had access to more computing power, we would be on our way to designing drugs more efficiently. Barry Werth's book "The Billion-Dollar Molecule" captured this zeitgeist well; the book is actually quite valuable since it's a rare as-it-happens study and not a more typical retrospective one, and therefore displays the same breathless and naive enthusiasm as its subjects.

And yet, 30 years after the prophecy was enunciated in great detail and to great fanfare, where are we? First, the good news. The bottom-up approach did yield great dividends - most notably in the field of HIV protease inhibitor drugs against AIDS. I actually believe that this contribution from the pharmaceutical industry is one of the greatest public services that capitalism has performed for humanity. Important drugs for lowering blood pressure and controlling heartburn were also the beneficiaries of top-down thinking. 

The bad news is that the paradigm fell short of the wild expectations that we had from it. Significantly short in fact. And the reason is what it always has been in the annals of human technological failure: ignorance. Human beings simply don't know enough about perturbing a biological system with a small organic molecule. Biological systems are emergent and non-linear, and we simply don't understand how simple inputs result in complex outputs. Ignorance was compounded with hubris in this case. We thought that once we understood how a molecule binds to a particular protein and optimized this binding, we had a drug. But what we had was simply a molecule that bound better to that protein; we still worked on the assumption that that protein was somehow critical for a disease. Also, a molecule that binds well to a protein has to overcome enormous other hurdles of oral bioavailability and safety before it can be called a drug. So even if - and that's a big if - we understood the physics of drug-protein binding well, we still wouldn't be any closer to a drug, because designing a drug involves understanding its interactions with an entire biological system and not just with one or two proteins.

In reality, diseases like cancer manifest themselves through subtle effects on a host of physiological systems involving dozens if not hundreds of proteins. Cancer especially is a wily disease because it activates cells for uncontrolled growth through multiple pathways. Even if one or two proteins were the primary drivers of this process, simply designing a molecule to block their actions would be too simplistic and reductionist. Ideally we would need to block a targeted subset of proteins to produce optimum effect. In reality, either our molecule would not bind even one favored protein sufficiently and lack efficacy, or it would bind the wrong proteins and show toxicity. In fact the reason why no drug can escape at least a few side effects is precisely because it binds to many other proteins other than the one we intended it to.

Faced with this wall of biological complexity, what do we do? Ironically, what we had done for hundreds of years, only this time armed with far more data and smarter data analysis tools. Simply put, you don't worry about understanding how exactly your molecule interacts with a particular protein; you worry instead only about its visible effects, about how much it impacts your blood pressure or glucose levels, or how much it increases urine output or metabolic activity. These endpoints are agnostic of knowledge of the detailed mechanism of action of a drug. You can also compare these results across a panel of drugs to try to decipher similarities and differences.

This is top-down drug design and discovery, writ large in the era of Big Data and techniques from computer science like machine learning and deep learning. The field is fundamentally steeped in data analysis and takes advantage of new technology that can measure umpteen effects of drugs on biological systems, greatly improved computing power and hardware to analyze these effects, and refined statistical techniques that can separate signal from noise and find trends.

The top-down approach is today characterized mainly by phenotypic screening and machine learning. Phenotypic screening involves simply throwing a drug at a cell, organ or animal and observing its effects. In its primitive form it was used to discover many of today's important drugs; in the field of anxiety medicine for instance, new drugs were discovered by giving them to mice and simply observing how much fear the mice exhibited toward cats. Today's phenotypic screening can be more fine-grained, looking at drug effects on cell size, shape and elasticity. One study I saw looked at potential drugs for wound healing; the most important tool in that study was a high-resolution camera, and the top-down approach manifested itself through image analysis techniques that quantified subtle changes in wound shape, depth and appearance. In all these cases, the exact protein target the drug might be interacting with was a distant horizon and an unknown. The large scale, often visible, effects were what mattered. And finding patterns and subtle differences in these effects - in images, in gene expression data, in patient responses - is what the universal tool of machine learning is supposed to do best. No wonder that every company and lab from Boston to Berkeley is trying feverishly to recruit data and machine learning scientists and build burgeoning data science divisions. These companies have staked their fortunes on a future that is largely imaginary for now.

Currently there seems to be, if not a war, at least a simmering and uneasy peace between top-down and bottom-up approaches in drug discovery. And yet this seems to be mainly a fight where opponents set up false dichotomies and straw men rather than find complementary strengths and limitations. First and foremost, the ultimate proof of the pudding is in the eating, and machine learning's impact on the number of approved new drugs still has to be demonstrated; the field is simply too new. The constellation of techniques has also proven itself to be much better at solving certain problems (mainly image recognition and natural language processing) than others. A lot of early stage medicinal chemistry data contains messy assay results and unexpected structure-activity relationships (SAR) containing "activity cliffs" in which a small change in structure leads to a large change in activity. Machine learning struggles with these discontinuous stimulus-response landscapes. Secondly, there are still technical issues in machine learning such as working with sparse data and noise that have to be resolved. Thirdly, while the result of a top-down approach may be a simple image or change in cell type, the number of potential factors that can lead to that result can be hideously tangled and multifaceted. Finally, there is the perpetual paradigm of garbage-in-garbage-out (GIGO). Your machine learning algorithm is only as good as the data you feed it, and chemical and biological data are notoriously messy and ill-curated; chemical structures might be incorrect, assay conditions might differ in space and time, patient reporting and compliance might be sporadic and erroneous, human error riddles data collection, and there might be very little data to begin with. The machine learning mill can only turn data grist into gold if what it's provided with is grist in the first place.

In contrast to some of these problems with the top-down paradigm, bottom-up drug design has some distinct advantages. First of all, it has worked, and nothing speaks like success. Also operationally, since you are usually looking at the interactions between a single molecule and protein, the system is much simpler and cleaner, and the techniques to study it are less prone to ambiguous interpretation. Unlike machine learning which can be a black box, here you can understand exactly what's going on. The amount of data might be smaller, but it may also be more targeted, manageable and reproducible. You don't usually have to deal with the intricacies of data fitting and noise reduction or the curation of data from multiple sources. Ultimately at the end of the day, if like HIV protease your target does turn out to be the Achilles heel of a deadly disease like AIDS, your atom-by-atom design can be as powerful as Thor's hammer. There is little doubt that bottom-up approaches have worked in selected cases, where the relevance of the target has been validated, and there is little doubt that this will continue to be the case.

Now it's also true that just like with top-downers, bottom-uppers have had their burden of computational problems and failures, and both paradigms have been subjected to their fair share of hype. Starting from that "designing drugs using computers" headline in 1981, people have understood that there are fundamental problems in modeling intermolecular interactions: some of these problems are computational and in principle can be overcome with better hardware and software, but others like the poor understanding of water molecules and electrostatic interactions are fundamentally scientific in nature. The downplaying of these issues and the emphasizing of occasional anecdotal successes has led to massive hype in computer-aided drug design. But in case of machine learning it's even worse in some sense since hype from applications of the field in other human endeavors is spilling over in drug discovery too; it seems hard for some to avoid claiming that your favorite machine learning system is going to soon cure cancer if it's making inroads in trendy applications like self-driving cars and facial recognition. Unlike machine learning though, the bottom-up take has at least had 20 years of successes and failures to draw on, so there is a sort of lid on hype that is constantly waved by skeptics.

Ultimately, the biggest advantage of machine learning is that it allows us to bypass detailed understanding of complex molecular interactions and biological feedback and work from the data alone. It's like a system of psychology that studies human behavior purely based on stimuli and responses of human subjects, without understanding how the brain works at a neuronal level. The disadvantage is that the approach can remain a black box; it can lead to occasional predictive success but at the expense of understanding. And a good open question is to ask how long we can keep on predicting without understanding. Knowing how many unexpected events or "Black Swans" exist in drug discovery, how long can top-down approaches keep performing well?

The fact of the matter is that both top-down and bottom-up approaches to drug discovery have strengths and limitations and should therefore be part of an integrated approach to drug discovery. In fact they can hopefully work well together, like members of a relay team. I have heard of at least one successful major project in a leading drug firm in which top down phenotypic screening yielded a valuable hit which then, midstream, was handed over to a bottom-up team of medicinal chemists, crystallographers and computational chemists who deconvoluted the target and optimized the hit all the way to an NDA (New Drug Application). At the same time, it was clear that the latter would not have been made possible without the former. In my view, the old guard of the bottom-up school has been reluctant and cynical in accepting membership in the guild for the young Turks of the top-down school, while the Turks have been similarly guilty of dismissing their predecessors as antiquated and irrelevant. This is a dangerous game of all-or-none in the very complex and challenging landscape of drug discovery and development, where only multiple and diverse approaches are going to allow us to discover the proverbial needle in the haystack. Only together will the two schools thrive, and there are promising signs that they might in fact be stronger together. But we'll never know until we try.

(Image: BenevolentAI)

Hit picking parties, eviscerating weaknesses and other vistas from the world of molecular docking

Here’s a good review of the pitfalls and promises of molecular docking by John Irwin and Brian Shoichet (UCSF) which is worth your time, especially if you are a non-specialist who wants a summary of what’s happening in the field. As the review notes, there are many first principles-based reasons why docking should not work – poor calculation of ligand conformations, poor treatment of protein and ligand electrostatics and desolvation, non-existent consideration of protein movement, wishful treatment of water molecules, sloppy representation of x-ray structures...the list just goes on. When docking 107 or so ligands involving 1013 total configurations, any one of these “maddening details” can doom your study.

And yet as the authors note both through general considerations and a few case studies, incremental but steady improvements in docking methodology have now made the technique respectable in most structure-based drug design campaigns (which interestingly have provided more drug candidates than HTS). Several reasons have contributed to this respectability. The first is the sheer throughput; no experimental technique can possibly screen 10 million ligands in a few days or weeks, so even with its flaws docking rises up at least as a potential complement to experiment. As the review notes, even a 10% success rate in finding new binders to a good target would be an improvement, and with well-defined binding sites the success rate can surpass high throughput screening (thus, the correct question to ask is not whether your method gives you false positives and negatives but whether this error rate is enough to overwhelm the experimental discovery rate for true positives). The second factor is the existence of massive databases like ZINC and ChEMBL containing millions of annotated ligands which could serve as starting points for ligand discovery.

Thirdly, while the holy grail of docking would be to correctly predict the absolute affinities of your ligands or at least to rank them, the more modest goal is to try to separate binders from non-binders and to discover novel chemotypes. Docking has been reasonably successful in meeting this goal, and the review presents several case studies that discovered interesting ligands which were dissimilar to known ones. In turn however, what really makes the discovery of novel chemotypes interesting is that it could lead to novel biology. It is this ability to potentially “break out of medicinal chemistry boxes” that makes docking attractive. For instance you could potentially find agonists by docking when you only found antagonists before, or – in what is one of the more interesting examples illustrated in the article – you could ‘deorphanize’ an enzyme by docking potential substrates to it and predicting its reaction profile. I still find this evidence anecdotal, but it's at least a good starting point for trying out things.

The rest of the review is also useful, not in the least because - given that it's from the Shoichet lab - there's also an instructive checklist of caveats to keep in mind while experimentally screening (PAINS, aggregators etc.) ligands. There is also a discussion of using homology models for docking. Using homology models is tricky even for lead optimization, so I would be wary in applying them too widely for high-throughput docking. While the review does illustrate an interesting case involving a GPCR, I want to note that I did blog about this case when it was published and described how – new ligands notwithstanding – the calculation seemed to use enough computing power and models to light up a startup, along with copious expert input.

It’s this last point in the review which is really the crux of the matter. When the servers have cooled down and the electrons have stopped flowing, the most important equipment that one can bring to bear on a docking study is a good pair of eyes connected to an experienced brain. Even with small error rates "the scum can rise to the top" ("The scum is out there" could be a good tagline for a X-Files episode about HTS). There is little substitute for careful inspection of top docked hits and looking at things like strain, abnormal charged interactions and wrong tautomeric states; otherwise staring down a computer screen would present the same risks as staring down a gun barrel.

The Shoichet lab has occasional "hit-picking parties" where teams of medicinal and computational chemists examine docked structures, and I suspect these parties are more common in other places than you think (although probably not as common as they should be). It’s only when the high throughput-low accuracy domain of docking meets the low throughput-high accuracy domain of the human mind that docking will continue to be successful. Given what we have seen so far I think there are grounds for hope.

All active kinase states are similar but...

CDK2 bound to inhibitor staurosporine (PDB code: 1aq1)
…inactive kinase states are different in their own ways. This admittedly awkward rephrasing of Tolstoy's quote came to my mind as I read this new report on differences between active and inactive states of kinases as revealed by differential interactions with inhibitors. 

The paper provides a good indication of why kinases continue to provide such enduring objects of interest for drug designers. Ever since imatinib (Gleevec) opened the floodgates to kinase drugs and quenched the widespread belief that it might well be impossible to hit particular kinases selectively, researchers have realized that kinase inhibitors may work either by targeting the active as well as the inactive states of their proteins. One intriguing observation emerging from the last few years is that inhibitors targeting inactive, monomeric states of kinases seem to provide better selectivity than those targeting phosphorylated, active states. In this study the authors (from Oxford, Padua and Newcastle) interrogate this phenomenon specifically for cyclin-dependent kinases (CDKs).

CDKs are prime regulators of the cell cycle and their discovery led to a Nobel Prize a few years ago. Over the last few years there have been several attempts to target these key proteins, not surprisingly especially in the oncology field; I worked on a rather successful kinase inhibitor project myself at one point. CDKs are rather typical kinases, having low activity when not bound to a cyclin but becoming active and ready to phosphorylate their substrates when cyclin-bound. What the authors did was to study the binding of a diverse set of inhibitors to a set of CDKs ranging from CDK2 to CDK9 by differential scanning fluorimetry (DSF). DSF can provide information on binding affinity of inhibitors by way of the melting temperature Tm. Measuring the Tm values for binding to different kinases can thus give you an idea of inhibitor selectivity.

The main observation from the study is that there is much more discrimination in the Tm values and therefore the binding affinities of the inhibitors when they are bound to the monomeric state than when they are bound to active, cyclin-bound states. Interestingly the study also finds that the same CDK in different states provides more discrimination in inhibitor binding compared to different CDKs in the same state. There is thus plenty of scope in targeting the same CDK based on its different states. In addition, binding to the monomeric state of specific cyclins will be a better bet for engineering selectivity than binding to the active, cyclin-bound states.

There are clearly significant structural differences between the inactive and active states of CDKs, mainly related to the movement of the so-called alphaC-helix. This study promisingly indicates that structure-based drug design experts can potentially target such structural features of the binding sites in the inactive states and design more selective drugs.

Protein-ligand crystal structures: WYSI(N)WYG

Crystal structures of proteins bound to small molecules have become ubiquitous in drug discovery. These structures are routinely used for docking, homology modeling and lead optimization. Yet as several authors have shown over the years, many of these structures can hide flaws that don't become apparent until they are actually revealed through analysis. Worse still, there may be flaws that never become apparent because nobody takes the trouble to look at the original data.

A recent paper from the group at OpenEye has a pretty useful analysis of flaws in crystal structures. They carry on a tradition most prominently exemplified by Gerard Kleywegt at Uppsala. The authors describe common metrics used for picking crystal structures from the PDB and demonstrate their limitations. They propose new metrics to remedy the problems with these parameters. And they use these metrics to analyze about 700 crystal structures used in structure-based drug design and software validation and find out that only about 120 structures or so pass the rigorous criteria used for distinguishing good structures from bad ones.

The most important general message in the article is about the difference between accuracy and completeness of data, and the caveat that any structure on a computer screen is a model and not reality. Even very accurate looking data may be incomplete and this flaw is often neglected by modelers and medicinal chemists when picking crystal structures. For instance, the resolution of a protein structure is often used as a criterion for selecting one among many structures of the same protein from the PDB. Yet the resolution only tells you how far apart atoms can be distinguished from each other. It does not tell you if the data is complete to begin with. So for instance, a crystallographer can acquire only 80% of the theoretically maximum possible data and present it with a resolution of 1.5 A, in which case the structure is clearly incomplete and possibly flawed for use in structure-based drug design. Another important metric is the R-free factor which is obtained by omitting certain parts of the data and refitting the rest to the model. A difference between R-free and the R-factor (a factor denoting the original difference between the full set of data and the model) of more than 0.45 for a structure with resolution 3.5 A or more is a red flag.

The OpenEye authors instead talk about a variety of measures that provide much better information about the fidelity of the data than resolution. The true measure of the data is of course the actual electron density. Any structure that is seen on the computer screen results from the fitting of a model to this electron density. While proteins are often fit fairly well to the density, the placement of ligands is often more ambiguous, partly because protein crystallographers are not always interested in the small molecules. The paper documents several examples of electron density that was either not fit or incorrectly fit by ligand atoms. In some cases sparse or even non-existent density was fit by guesswork, as in the illustration above. All these badly fit atoms show up in the final structure but only an expert would know this. The only way to overcome these problems is to take a look at the original electron density, which is unfortunately a task that most medicinal chemists and modelers are ill equipped for.

The worst problem with these crystal structures is that because they look so accurate and nice on a computer screen, they may fall into Donald Rumsfeld's category of "unknown unknowns". But even with "known unknowns" the picture is a disturbing one. When the authors apply all their filters and metrics to a set of 728 protein-ligand structures frequently used in benchmarking docking programs and in actual structure-based drug design projects, they find that only 17% or so (121) of the structures make it through. They collect these structures together and call the resulting database Iridium which can be downloaded from their website. But the failure of the 580 or so common structures used by leading software vendors and pharmaceutical companies to pass important filters leaves you wondering how many resources we may have wasted by using them and how much uncertainty is out there. 

Something to think about, especially when considering the 50,000 or so protein structures in the PDB. What you see may not be what you get.

Modeling magic methyls


Shamans have their magic mushrooms, we medicinal chemists have our magic methyls. The 'magic methyl effect' refers to the sometimes large and unexpected change in drug potency resulting from the addition of a single methyl group to a molecule (for laymen, a methyl group is a single carbon with three hydrogens, and you don't expect spectacular effects from the addition of such a small modification to a drug). This is part of a broader paradigm in chemistry in which small changes in molecular structure can bring about large changes in properties, especially biological activity.

In this study, researchers from Bill Jorgensen's group at Yale ask what exactly it is that a methyl group can do to a biologically active molecule. They look at more than 2000 cases of purported activity changes induced by methyl groups from two leading medicinal chemistry journals, and their work highlights some unexpected effects of methyls. The techniques used are free-energy perturbation, Monte-Carlo and molecular dynamics simulations to compare methylated and non-methylated versions of published inhibitors in an effort to gain insight into the factors dictating potency.

Firstly, they find that for all their reputation, methyls mostly confer a modest increase in potency. The greatest increase is about 3 kcal/mol in free energy, which considering the exponential relationship between free energy and binding constant, is actually quite substantial. But this happens in a negligible minority of cases; as they find, a 10 fold boost in potency with a methyl is seen in only 8% of the cases, while a 100 fold difference is seen in only 0.4%.

So what does a methyl do? For starters, a methyl is simply a nice, small, lipophilic group so you would expect it to give you some advantage simply by snugly fitting in in an otherwise unoccupied binding pocket. But the real advantage of a methyl is thought to come from kicking out 'unhappy' water molecules; often a small protein pocket is occupied by a highly constrained water molecule that is desperate to join its free brethren in the bulk. A methyl group is usually only too happy to oblige and kick the water out. Now, as the crystallographer Jack Dunitz demonstrated more than a decade back, the maximum free energy gain you could estimate from displacing a water molecule is about 2 kcal/mol. Considered in this light, you would expect a gain of at least that much from a hydrogen to methyl change, but the very rare 3 kcal/mol cases seen seem to call this belief into question. Clearly the common wisdom about methyls displacing waters is not telling us the entire story.

As the authors demonstrate, the common wisdom may indeed point to uncommon cases. They look at five cases where methyls give the greatest potency boost, and in no case do they find evidence for displacement of a water molecule. So where's the potency gain coming from? It turns out that it may be coming from a common but often underappreciated factor; conformational reorganization. When a ligand binds to a protein, it exists in several - often hundreds - of conformations in solution. How tightly the protein can bind the ligand depends on how much energy the protein can expend to twist and turn these conformations into the single bound conformation. You would expect that the more similar a molecule's unbound conformation in solution is to its protein-bound conformation, the easier it would be for the protein to latch on to it.

And indeed, that's what they find. Most of the cases they look at concern potency gains coming from putting methyls at the ortho position of a biaryl ring. Organic chemists are quite familiar with the steric, planarity-disrupting effects of ortho substituents on biaryl rings; in fact it's a tried and tested strategy to improve solubility by disrupting crystal packing effects. It turns out that the bound structures of the molecules present a twisted, non-planar conformation. In the absence of methyls, the rings would prefer to stay almost coplanar (or at least less non-planar) in the unbound conformation. But putting methyl groups on twists the rings in the unbound conformation into a form that's similar to the bound one; basically there's more overlap between the solution and bound conformations in case of the methylated versions compared to the non-methylated ones. Consequently, the protein has to expend less energy to turn an already similar conformation into its bound counterpart. This becomes clear simply by comparing the single dihedral angle bridging the two rings in the bound and unbound conformations of one particular molecule (as shown in the figure above).

The study seems to impart what is part of an important general lesson; when designing ligands or drugs to bind a protein, it is as important to take the solution conformations into account. It's not just how the protein interacts with the drug, but it's what the drug is doing before it meets the protein that's equally important.

A molecular modeler's view of molecular beauty

The last few months have seen discussion on the definition of "beauty" as applied to molecular structures. As a molecular modeler, I ended up ruminating on what I think are the features of molecules that I would call beautiful.

This kind of exercise may seem pointless since it's highly subjective. And yet I think that exploring these definitions is valuable because at least a few of the conflicts that occasionally erupt between groups of interdisciplinary scientists seem to reside in divergent definitions of molecular beauty. A typical scenario is when a modeler designs a molecule that seems to fit perfectly in a protein binding pocket, only to have it dismissed by the medicinal chemist because it's too hard to make. Or because it looks like a detergent. In this case, structural beauty has not necessarily translated to practical beauty- always a consideration when one is talking about molecules which ultimately have to be synthesized. In fact, projects can often progress when different scientists agree - often unconsciously - on a notion of beauty or utility and they can falter when these ideas lead different scientists down very different paths.

So what kind of molecules does a modeler find pleasing? Let's first look at structure-based design which in its broadest sense entails acquiring crystal structures of proteins and finding and tailoring molecules to fit the binding site of those proteins. Beauty in structure-based design is not hard to appreciate since it inherently involves protein structures, paragons of symmetry and parsimonious elegance. In this sense the modeler has much more in common with crystallographers than medicinal chemists. He or she is always struck by how precisely oriented the amino acid residues in a protein's binding site are. Concomitantly, it's not surprising that for a modeler, a particularly beautiful molecule may be one that makes interactions with most or all of the residues, especially in the form of hydrogen bonds with geometries so perfect that they might have fallen from the sky. This preference for geometric complementarity is extremely general. A particular pleasing manipulation for a modeler may be one that leads to a perfect fit for a given protein cavity, with the contours of the molecular surface of the molecule following those of the protein.

Yet there are limits to how far this beauty-from-complementarity can take you. Designing a molecule with interactions that are too tightly knit and precise may leave very little room for error. And considering the inherent inaccuracy of many modeling algorithms, it may be best to always be forgiving and give the design some space to move around in the binding site, so that slight reorganizations of both protein and ligand can still result in reasonably tight binding.

While we are on the subject of shape complementarity, let's not forget a factor which has only recently been started to be taken into account - water molecules filling the active site. These water molecules which are often highly constrained can have a perverse beauty of their own; they are sometimes trapped in both an entropic (no freedom of movement) and enthalpic (unable to satisfy their optimal four hydrogen bonds) cage, and yet they are the only things keeping the protein from what technically must be called a vacuum. These water molecules deserve to be freed, and there is a particularly deep kind of satisfaction in adding a functional group to a ligand that would displace them from the protein's tight embrace and enable them to join the ranks of their hydrogen-bonded brethren. A modeler could regard a modification precisely tailored to improve affinity by freeing up a water molecule as a particular beautiful case of molecular engineering.

Yet as we have sometimes seen, the beauty of filling protein pockets with perfectly matched shapes may lead to a conflict with other practitioners of the art. For starters, a relentless march toward adding functionality can detract from at least two important features - size and lipophilicity - both of which are often casualties of "molecular obesity". Dents in beauty can come from too many chiral centers, aromatic rings or quaternary carbons. Real beauty in structure-based design arises from a precise balance of complementarity and simplicity in the design. And of course there is the always important consideration of synthetic accessibility, which is why every modeler should ideally take a basic course in organic synthesis.

Beyond structure-based design, there are other considerations that endear molecules to modelers. Protein structures are not always available, and in the absence of such structures, molecular similarity alone is sometimes used to find novel active molecules. The question is simple; given a known molecule of high potency, how will you discover molecules similar to it in structure. The answer is complicated. The key to realize here is that this similarity must be defined in three dimensions since all molecular interactions are in 3D. Conformational similarity is one of the most elegant considerations for a modeler, and one that synthetic chemists often don't contemplate. Very few things provide as much satisfaction to a modeler as the observation that two molecules with very different 2D structures seem to adopt the same biologically active 3D conformation in solution and in the binding pocket. And a particularly gratifying and rare find - for both modelers and synthetic chemists - is a molecule which nature has sterically constrained to a single dominant conformation by the judicious placement of certain functional elements. Discodermolide, a microtubule binding agent which I worked on in graduate school, belongs to this category.

Those then are the criteria that come to my mind for defining molecular elegance - perfect (but not too perfect) shape complementarity to a precisely structured binding site, a well-defined 3D conformation often similar to a completely different molecule and simplicity in structure in spite of these other demanding features. Conflicts arise when modelers' and chemists' notions of molecular pulchritude diverge. And it's a happy coincidence when these sensibilities overlap, which is more often than you might think.