The Curious Wavefunction: April 2014

The structure of DNA, 61 years later: How they did it.

By Wavefunction on Wednesday, April 30, 2014

: "A Structure for Deoxyribose Nucleic Acid", Nature, April 25, 1953 (Image: Oregon State University)

This month marks the sixty-first anniversary of the publication of the landmark paper on the structure of DNA by Watson and Crick, which appeared in the April 25, 1953 issue of the journal Nature. Even fifty years later the discovery is endlessly intriguing, not just because it's so important but because in 1954, both Watson and Crick were rather unlikely characters to have made it. In 2012 I wrote a post for the Nobel Week Dialogue event in Stockholm with a few thoughts on what it exactly was that allowed the duo to enshrine themselves in the history books; it was not sheer brilliance, it was not exhaustive knowledge of a discipline, but it was an open mind and a relentless drive to put disparate pieces of the puzzle together. Here's that post.

Somehow it all boils down to 1953, the year of the double helix. And it’s still worth contemplating how it all happened.

Science is often perceived as either a series of dazzling insights or as a marathon. Much of the public recognition of science acknowledges this division; Nobel Prizes for instance are often awarded either for a long, plodding project that is sustained by sheer grit (solving a protein crystal structure), a novel idea that seems to be an inspired work of sheer genius (formulating the Dirac equation) or an accumulated body of work (organic synthesis).

But in one sense, both these viewpoints of science are flawed since both of them tend to obscure the often haphazard, unpredictable, chancy and very human process of research. In reality, the marathon runner, the inspired genius and every scientist in between the two tread a tortuous path to the eureka moment, a path that’s highlighted by false alleys, plain old luck, unexpected obstacles and most importantly, the human obstacles of petty rivalry, jealousy, confusion and misunderstanding. A scientific story that fully captures these variables is, in my opinion, emblematic of the true nature of research and discovery. That is why the discovery of the double helix by Watson and Crick is one of my favorite stories in all of science.

The reason why that discovery is so appealing is because it really does not fit into the traditional threads of scientific progress highlighted above. During those few heady days in Cambridge in the dawn of those gloomy post-war years, Watson and Crick worked hard. But their work was very different from, say, the sustained effort akin to climbing a mountain that exemplified Max Perutz’s lifelong odyssey to solve the structure of hemoglobin. It was also different from the great flashes of intuition that characterized an Einstein or a Bohr, although intuition was applied to the problem – and discarded – liberally. Neither of the two protagonists was an expert in the one discipline that they themselves acknowledged mattered most for the discovery – chemistry. And although they had a rough idea of how to do it, neither really knew what it would take to solve the problem. They were far from being experts in the field.

And therein lies the key to their success. Because they lacked expertise and didn’t really know what would solve the problem, they tried all approaches at their disposal. Their path to DNA was haphazard, often lacking direction, always uncertain. Crick, a man who already considered himself an overgrown graduate student in his thirties, was a crystallographer. Watson, a precocious and irreverent youngster who entered the University of Chicago when he was fifteen, was in equal parts geneticist and bird-watcher. Unlike many of their colleagues, both were firmly convinced that DNA and not protein was the genetic material. But neither of them had the background for understanding the chemistry that is essential to DNA structure; the hydrogen bonding that holds the bases together, the acid-base chemistry that ionizes the phosphates and dictates their geometric arrangement, the principles of tautomerism that allow the bases to exist in one of two possible forms; a form that’s crucial for holding the structure together. But they were willing students and they groped, asked, stumbled and finally triumphantly navigated their way out of this conceptual jungle. They did learn all the chemistry that mattered, and because of Crick they already understood crystallography.

And most importantly, they built models. Molecular models are now a mainstay of biochemical research. Modelers like myself can manipulate seductively attractive three-dimensional pictures of proteins and small molecules on computer screens. But modeling was in its premature days in the fifties. Ironically, the tradition had been pioneered by the duo’s perceived rival, the chemist Linus Pauling. Pauling who would be widely considered the greatest chemist of the twentieth century had successfully applied his model-building approach to the structure of proteins. Lying in bed with a bad cold during a visiting sojourn at Oxford University, he had folded paper and marked atoms with a pencil to conform to the geometric parameters of amino acids derived from simple crystal structures. The end product of this modeling combined with detailed crystallographic measurements was one of twentieth century biochemistry’s greatest triumphs; the discovery of the alpha-helical and beta-sheet structures, foundational structural elements in virtually every protein in nature. How exactly the same model-building later led Pauling to an embarrassing gaffe in his own structure of DNA that violated basic chemical principles is the stuff of folklore, narrated with nonchalant satisfaction by Watson in his classic book “The Double Helix”.

Model building is more art than science. By necessity it consists of patching together imperfect data from multiple avenues and techniques using part rational thinking and part inspired guesswork and then building a picture of reality – only a picture – that’s hopefully consistent with most of the data and not in flagrant violation with important pieces. Even today modeling is often regarded skeptically by the data-gatherers, presumably because it does not have the ring of truth that hard, numerical data has. But data by itself is never enough, especially because the methods to acquire it themselves are incomplete and subject to error. It is precisely by combining information from various sources that one expects to somehow cancel these errors or render them unimportant, so that the signal from one source complements its absence in another and vice versa. The building of a satisfactory model thus often necessarily entails understanding data from multiple fields, each part of which is imperfect.

Watson and Crick realized this, but many of their contemporaries tackling the same problem did not. As Watson recounts it in a TED talk, Rosalind Franklin and Maurice Wilkins were excellent crystallographers but were hesitant to build models using imperfect data. Franklin especially came tantalizingly close to cracking DNA. On the other hand Erwin Chargaff and Jerry Donahue, both outstanding chemists, were less appreciative of crystallography and again not prone to building models. Watson and Crick were both willing to remedy their ignorance of chemistry and to bridge the river of data between the two disciplines of chemistry and crystallography. Through Donohue they learnt about the keto-enol tautomerism of the bases that gave rise to the preferred chemical form. From Chargaff came crucial information regarding constancy of the ratios of one kind of base (purines) to another (pyrimidines); this information would be decisive in nailing down the complementary nature of the two strands of the helix. And through Rosalind Franklin they got access – in ways that even today spark controversy and resentment – to the best crystallographic data on DNA that then existed anywhere.

What was left to do was to combine these pieces from chemistry and crystallography and put together the grand puzzle. For this model building was essential; since Watson and Crick were willing to do whatever it took to solve the structure, to their list of things-to-do they added model building. Unlike Franklin and Wilkins, they had no qualms about building models even if it meant they got the answer partially right. The duo proceeded from a handful of key facts, each of which other people possessed, but none of which had been seen by the others as part of an integrated picture. Franklin especially had gleaned very important general features of the helix from her meticulous diffraction experiments and yet failed to build models, remaining skeptical about the very existence of helices until the end. It was the classic case of the blind men and the elephant.

The facts which led Watson and Crick down the road to the promised land included a scattered bundle of truths about DNA from crystallography and chemistry; the distance between two bases (3.4 Å), the distance per turn of the helix (34 Å) which in turn indicated a distribution of ten bases per turn, the diameter of the helix (20 Å), Chargaff’s rules indicating equal ratios of the two kinds of bases, Alexander Todd’s work on the points of linkage between the base, sugar and nucleotide, Donohue’s important advice regarding the preferred keto form of the bases and Franklin’s evidence that the strands in DNA must run in opposite directions. There was another important tool they had, thanks to Crick’s earlier mathematical work on diffraction. Helical-diffraction theory told them the kind of diffraction pattern that would expect if the structure were in fact helical. This reverse process – predicting the expected diffraction parameters from a model – is today a mainstay of the iterative process of structure refinement used by x-ray crystallographers to solve structures as complex as the ribosome.

Using pieces from the metal shop in Cambridge, Watson gradually accumulated a list of parts for the components of DNA and put them together even as Crick offered helpful advice. Once the pieces were in place, the duo were in the position of an airline pilot who has every signpost, flag and light on the runway paving his way for a perfect landing. The end-product was unambiguous, incisive, elegant, and most importantly, it held the key to understanding the mechanism of heredity through complementary base-pairing. Franklin and Wilkins came down from London; the model was so convincing that even Franklin graciously agreed that it had to be correct. Everyone who saw the model would undoubtedly have echoed Watson and Crick’s sentiment that “a structure this beautiful just had to exist”.

In some sense the discovery of the DNA structure was easy; as Max Perutz once said, the technical challenges that it presented were greatly mitigated because of the symmetry of the structure compared to the controlled but tortuous asymmetry inherent in proteins. Yet it was Watson and Crick and not others who made this discovery and their achievement provides insight into the elements of a unique scientific style. Intelligence they did not lack, but intelligence alone would not have helped, and in any case there was no dearth of it; Perutz, Franklin, Chargaff and Pauling were all brilliant scientists who in principle could have cracked open the secret of life which its discoverers proudly touted that day in the Eagle Pub.

But what these people lacked, what Watson and Crick possessed in spades, was a drive to explore, interrogate, admit ignorance, search all possible sources and finally tie the threads together. This set of traits also made them outsiders in the field, non-chemists who were trying to understand a chemical puzzle; in one sense they appeared out of nowhere. But because they were outsiders they were relatively unprejudiced. Their personalities cast them as misfits and upstarts trying to disrupt the established order. Then there was the famous irreverence between them; Crick once said that politeness kills science. All these personal qualities certainly helped, but none was as important as a sprightly open-mindedness that was still tempered by unsparing rigor, the ability to ask for and use evidence from all quarters while constraining it within reasonable bounds all the time; this approach led to model building almost as a natural consequence. And the open-mindedness also masked a fearlessness that was undaunted by the imperfect nature of the data and the sometimes insurmountable challenges that seemed to loom.

So that’s how they did it; by questioning, probing, conjecturing and model building even in the presence of incomplete data, and by fearlessly using every tool and idea at their disposal. As we approach problems of increasing biological complexity in the twentieth century, this is a lesson we should keep in mind. Sometimes when you don’t know what approach will solve a problem, you try all approaches, all the time constraining them within known scientific principles. Richard Feynman once defined scientific progress as imagination in a straitjacket, and he could have been talking about the double helix.

First published on Nobel Week Dialogue and the Scientific American Blog Network.

Drug costs and prices: Here we go again

By Wavefunction on Friday, April 25, 2014

: Gilead's hepatitis C drug sofosbuvir (Sovaldi)

Misleading statements and conclusions regarding drug costs and prices are again being thrown around. It started with a post right here on Scientific American Blogs with the title "The Quest: $84,000 Miracle Cure Costs Less Than $150 to Make". As the title indicates, the post is about a new hepatitis C drug called Sovaldi developed by Gilead Sciences. The $150 was the manufacturing cost, the $84,000 was the price. The medicine is considered a real breakthrough and both Gilead and its shareholders have been rewarded with handsome profits during the last quarter. Sovaldi is also regarded as a first-in-class treatment for what was always considered a highly refractory disease.

To its credit the piece was measured and did make the point that the $84,000 was less than the hospitalization and liver transplant costs incurred by hep C patients until now. But unfortunately I think the title of the post was inevitably misleading because it made it sound like every extra penny in addition to the $150 was a profit margin. It also did not put the cost of developing the drug in the context of the very significant barriers to new drug development in the form of formidable scientific challenges, patent cliffs and FDA hurdles. The fact is that drug pricing is a very complex beast and while the system is definitely in need of reform and rational thinking, comparing actual drug production costs with price is basically a futile endeavor that is far more likely to mislead than to enlighten.

Was I mistaken in thinking the title was misleading? Not really. A day later, a Facebook post by Scientific American highlighting the same post ran with the headline "Profit Margin for Hep C Drug: Approximately one gazillion percent". This headline is egregiously, woefully wrong: the quarterly profit margin for Gilead was 44% during the last quarter, but the standard profit margins for pharmaceutical companies are about 20%. The title is grossly simplistic and says nothing about the cost of discovering (not manufacturing) the drug. Scientific American has 1.7 million followers on Facebook so this title is going to mislead quite a few, as is evident from the comments section of that post.

The bigger question to ask is, why are the profit margins so apparently high for Sovaldi? And that's a question the post did not address. The fact is that this class of hepatitis C drugs constitutes a real breakthrough in the treatment of the disease, and one that has been sought for decades. Until now the standard of treatment for the disease consisted of a combination of PEGylated interferon and ribavirin, a stop-gag measure that results in debilitating flu-like symptoms acutely affecting quality of life. The new class of medications directly targets a viral protein called a protease that the virus uses to make new copies of itself. It hits the source and is, by any definition, a vastly superior treatment compared to what we had before. For years the pharmaceutical company has been rightly lambasted for making "me-too" drugs, medicines that are marginally better than what existed on the market and whose high sales and profits are driven mainly by aggressive marketing rather than by real benefits. In this case we actually have a novel drug that provides real benefits, a significant achievement on the part of drug research that needs to be appreciated. Let's criticize high costs, but let's not ignore the improvements in quality of life for patients that simply did not exist before. And let's not fail to congratulate the teams of researchers who actually discover these novel medicines floating in a sea of me-too drugs; it's as big a testament to human ingenuity and perseverance as anything else in scientific history.

Does the price of Sovaldi sound high? Undoubtedly. But that's when I invoke my own "Law of Large Numbers" which roughly says, "The kind of reaction you have when you see a large number seldom has much to do with what that number really means". As with many other numbers we have to put the price in context. As the original post notes, it's still lower than what the price of hospitalization and liver transplants would have been. If you have insurance then your insurance company - encouraged largely by this comparative calculation - should take care of it. Is it still not accessible to a vast number of people in poor countries? Of course; I would like that to happen as much as anything else, especially since hep C is still a disease that puts most of its burden on the poor. But it's at least accessible to a few people whose quality of life will now be much better and it's still better than nothing. That's an important step in the right direction. As happened with AIDS drugs, at some point the medication will in fact become cheap enough, especially when generics take its place a few years down the line. But none of this would have happened if pharmaceutical companies stopped making profits and had to shut down because they couldn't recover R&D costs.

The most important context we need to understand again is one that is often neglected: a comparison of the price with the high cost of drug discovery and development rather than manufacturingwhich is always going to be low. The bare fact of the matter is that it takes about $5 billion to develop a new drug and this has little to do with profit margins. As a hypothetical example, even if the profit margin on Sovaldi was 0% it would still cost about $50,000 dollars and you would still hear a lot of criticism. The cost of a drug is not $5 billion because there's much profit to be made - as I indicated above pharmaceutical profit margins are about 20% - but because drug discovery is so hard. One can quibble about the exact number but that would only serve to obscure the real challenges. I have written four posts on the complexities of drug discovery science and I intend to write more. The reality is that Sovaldi would not have been invented in the first place had companies like Gilead not put enough money from profits into the R&D that went into its development. And sure, some pharmaceutical CEO's make obscene bonuses and and we need to have important conversations about that, but even that part does not significantly contribute to the high costs.

One thing I find amusing is that the same critics who talk about high drug costs also seem to have some kind of an exaggerated picture of drug company scientists as infallible oracles, discovering drugs with their left hand and then sitting back in their sleek sofas, cackling and waiting for the shower of green. But the truth could not be more different. As a drug discovery scientist I have very few moments when I actually think I am any closer to discovering even a new lead, let alone a new drug (and sadly no, I am also not rolling around in money). As I have quoted another scientist in one of my posts, the reason why drugs cost so much is not because we are greedy, it's because we are stupid. A lot of the times we feel the way Yamamoto must have felt in 1943, knowing that he was about to fight a war of attrition that he could not possibly win.

Drug discovery is one of the most wasteful research activities on the planet and it's all because most of the times we have no idea of how to go about discovering a new drug or what's going to happen when we put it inside the human body. The complexities of human biology thwart us at every stage and luck plays an inordinately large role in our success. Even basic issues in drug discovery - understanding how drugs get past cell membranes for instance - are generally unsolved problems, and the profligate inefficiency of the process would truly shame us if we knew how to do it better. The path from a new idea in pharmaceutical research to an actual drug is akin to a path trodden by a blind man along the edge of a cliff at night; any survival, let alone success, seems like a miracle in retrospect. No drug scientist will admit it, but every drug scientist crosses his or her fingers when a new drug makes it to the market because we are just not smart enough to figure out exactly what it will do in every single patient. That is why most drugs fail in advanced clinical trials, when hundreds of millions of dollars have already been spent on them.

The scientific challenges in drug discovery are a major reason why drugs are so expensive. Of course manufacturing costs are going to be low; once you have wasted so much money on R&D and hit the right solution you only have to make the actual substance (and I don't say "only" lightly here). Would one also decry the low manufacturing costs of integrated circuit boards and ignore the massive, extensive, often unsuccessful developments in R&D that resulted in the system being able to beat Moore's Law? It's the same for drug discovery, except that in this case nobody has any idea how we would even start to approach Moore's Law, let alone beat it. So criticize drug costs and CEO bonuses and profits all you like - and continue to have a much needed debate about healthcare costs in this country - but keep in mind that those numbers are more likely to wildly obfuscate than educate if you take them at face value.

Y Combinator and biotech: The wave of the future?

By Wavefunction on Friday, April 25, 2014

: Marcus Wohlsen's book "Biopunk" details the efforts of dedicated garage biotech enthusiasts. Startup incubators like Y-Combinator could help bring their efforts to fruition (Image: Steve Zazeski)

Y Combinator is the well-known startup incubator that picks promising computer technology startup ideas from a competition every year and seeds them with a few tens of thousands of dollars and dedicated space in Silicon Valley in return for an equity stake. It has been cited as one of the most successful startup accelerators around and has funded several successful companies like Scribd and Dropbox. I know something about how competitive and interesting their process is since my wife made it to the final round a few years ago.

Now they're doing something new: planning to fund biotech startups. At first glance this might seem like a bad idea since biotech ventures are both riskier as well as much more expensive compared to IT startups; as well-known chemistry blogger Chemjobber rightly pointed out to me on Twitter, the net ROI for biotech has been negative. But I for one find this to be a very promising idea, and I think of Y Combinator's move as very much of a futuristic, put-money-in-the-bank kind of thing. As the president of the accelerator puts it:

"Six years ago, the starting costs for biotech firms would not work with our model: it took millions of dollars to get anywhere. But the landscape has changed. We’ve noticed, over the past year, more and more promising biotech firms asking about Y Combinator. I think there is a real trend for biotech start-ups looking more like software start-ups. In ten years it won’t even be unusual for a biotech firm to begin in the same way a software firm does, and we want to be on the leading edge of that trend."

And I think he's right. There have been two major developments in the last ten years that are very promising for the future of "garage biotech": The first development is the staggering decline in the cost of DNA sequencing which has now even surpassed Moore's Law, and the second development is the rise of DIY biotech, nicely imagined by one of my mentors Freeman Dyson a few years ago and described by Marcus Wohlsen in an entire book which documents the rise of amateur biotech enthusiasts around the world. Thanks to forward-thinking endeavors like Y Combinator, these enthusiasts might finally find an outlet for their creativity.

The fact is that both equipment and techniques for biotechnology have become massively cheap in the last few years, and it is no longer a pipe dream to imagine a biotech Steve Jobs or Bill Gates coming up with a revolutionary new biotechnology-based diagnostic method or therapeutic in their garage. We are entering an age where young people can actually play with genetic engineering, just like they played with software and hardware in the 1970s. The startup funds that Y Combinator plans to inject in its biotech ventures ($120,000) should be more than adequate for buying used equipment and kits off of eBay. For now they are also planning to focus on biotech research benefiting from software, which seems to be the right place to look for cheap(er) ideas. If I had to bet on specific ideas I would probably bet on diagnostic techniques since therapeutics are usually far more complicated and uncertain. I think they should also look at general tools which promise to make complex endeavors like drug discovery faster and easier; as biology has amply demonstrated, revolutions in science and technology are as much about tools as about ideas.

The legal framework as well as the ultimate market incarnation of a biotech diagnostic or therapeutic is of course much more complicated than for a new computer product, but that part kicks in after the idea has gotten off the ground. In fact endeavors such as the one Y Combinator is planning on funding may well accelerate not just the domestication of biotechnology but people's thinking about important legal and social matters related to its widespread use. To enable true garage biotech we need not just a scientific revolution (which is very much visible) but a social, legal and intellectual property revolution (which is not). Each of them can feed off the other.

But for me the most exciting aspect of the announcement is simply the interest it will hopefully fuel in young people with novel biotech ideas. The computer revolution would never have occurred without thousands of young people starting hobbyist groups and tinkering around with electronics and software in their kitchens and garages, away from their parents' prying eyes. Science and technology are very much Darwinian endeavors, so the more ideas that are available for inspection, rejection and growth, the better it will be for the future of a particular technological paradigm. As Linus Pauling said, "To have a good idea you must first have lots of ideas, then throw the bad ones away". Encouraging people to have lots of ideas is precisely what Y Combinator is doing, so I wish them all good luck.

"A Fred Sanger would not survive today's world of science."

By Wavefunction on Wednesday, April 23, 2014

Somehow I missed last year's obituary for double Nobel laureate and bench scientist extraordinaire Fred Sanger by Sydney Brenner in Science. The characteristically provocative Brenner has this to say about a (thankfully) fictional twenty-first century Sanger:

A Fred Sanger would not survive today's world of science. With continuous reporting and appraisals, some committee would note that he published little of import between insulin in 1952 and his first paper on RNA sequencing in 1967 with another long gap until DNA sequencing in 1977. He would be labeled as unproductive, and his modest personal support would be denied. We no longer have a culture that allows individuals to embark on long-term—and what would be considered today extremely risky—projects.

Depressing.

"Designing drugs without chemicals"

By Wavefunction on Friday, April 04, 2014

As usual Derek beat me to highlighting this rather alarming picture from an October 5, 1981 issue of Fortune magazine that I posted on Twitter yesterday. The image is from an article about computer-aided design and it looks both like a major communications failure (chemical-free drugs, anyone?) as well as a massive dollop of hype about computer-aided drug design. In fact the article has been cited by drug designers themselves as an example of the famous hype curve, with 1981 representing the peak of inflated expectations.

It's intriguing to consider both how we are still considering pretty much the same questions about computational drug design that we were in 1981 as well as how much progress we have made on various fronts since then. I posted an extended comment about both these aspects of the issue on Derek's blog so I am just going to copy it it below. Bottom line: Many of the fundamental problems are still the same and are unsolved on the general level. However, there has been enough understanding and progress to expect solutions to a wide variety of specific problems in the near future. My own attitude is one of cautious optimism, which in drug discovery is usually the best you can have...

For anyone who wants the exact reference it's the Fortune magazine issue from Oct 5, 1981. The reference is widely considered to be both the time when CADD came to the attention of the masses, as well as a classic lesson in hype. The article itself is really odd since most of it is about computer-aided design in the industrial, construction and aeronautical fields; these are fields where the tools have actually worked exceedingly well. The part about drug design was almost a throwaway with almost no explanation in the text.

Another way to look at the issue is to consider a presentation by Peter Goodford in 1989 (cited in a highly readable perspective by John van Drie (J Comput Aided Mol Des (2007) 21:591–601) in which he laid out the major problems in molecular modeling - things like including water, building homology models, calculating conformational changes, predicting solubility, predicting x-ray conformations etc. What's interesting is that - aside from homology modeling and x-ray conformations - we are struggling with the exact same problems today as we were in the 80s.

That doesn't mean we haven't made any progress though. Far from it in fact. Even though many of these problems are still unsolved on a general level, the number of successful specific examples is on the rise so at some point we should be able to derive a few general principles. In addition we have made a huge amount of progress in understanding the issues, dissecting the various operational factors and in building up a solid database of results. Fields like homology modeling have actually seen very significant advances, although that's as much because of the rise of the PDB which was enabled through crystallography as accurate sequence comparison and threading algorithms. We are also now aware of the level of validation that our results need to have for everyone to take them seriously. Journals are implementing new standards for reproducibility and knowledge of the right statistical validation techniques is becoming more widespread; as Feynman warned us, hopefully this will stop us from fooling ourselves.

As you mention however, the disproportionate growth of hardware and processing power relative to our understanding of the basic physics of drug-protein interaction has led to an illusion of understanding and control. For instance it's quite true that no amount of simulation time and smart algorithms will help us if the underlying force fields are inaccurate and ill-tested. Thus you can beat every motion out of a protein until the cows come home and you still might not get accurate binding energies. That being said, we also have to realize that every method's success needs to be judged in terms of a particular context and scale. For instance an MD simulation on a GPCR might get some of the conformational details of specific residues wrong but may still help us rationalize large-scale motions that can be compared with experimental parameters. Some of the more unproductive criticism in the field has come from people who have the wrong expectations from a particular method to begin with.

Personally I am quite optimistic with the progress we have made. Computational drug design has actually followed the classic Gartner Hype curve, and it's only in the 2000s that we have reached that cherished plateau of realistic expectations. The hope is that at the very least this plateau will have a small but consistent positive slope.

Reproducibility in molecular modeling research

By Wavefunction on Tuesday, April 01, 2014

When I was a graduate student, a pet peeve of my advisor and I was the extreme unlikelihood of finding structural 3D coordinates of molecules in papers. A typical paper on conformational analysis would claim to derive a solution conformation for a druglike molecule and present a picture of the conformation, but the lack of 3D coordinates or a SDF or PDB file would make detailed inspection of the structure impossible. Sadly this was more generally in line with a lack of structural information regarding molecular structures in molecular modeling or structural chemistry articles.

Last year Pat Walters (Vertex) wrote an editorial in the Journal of Chemical Information and Modeling which I have wanted to highlight for a while. Walters laments the general lack of reproducibility in the molecular modeling literature and notes that while reproducibility is taken for granted in experimental science, there has been an unusual dearth of it in computational science. He also makes it clear that authors are running out of excuses in making things like original source code and 3D coordinates available under the pretext of lack of standard hardware and software platforms.

It is possible that the wide array of computer hardware platforms available 20 years ago would have made supporting a particular code base more difficult. However, over the last 10 years, the world seems to have settled on a small number of hardware platforms, dramatically reducing any sort of support burden. In fact, modern virtual machine technologies have made it almost trivial to install and run software developed in a different computing environment...

When chemical structures are provided, they are typically presented as structure drawings that cannot be readily translated into a machine-readable format. When groups do go to the trouble of redrawing dozens of structures, it is almost inevitable that structure errors will be introduced. Years ago, one could have argued that too manyfile formats existed and that it was difficult to agree on a common format for distribution. However, over the last 10 years, the community seems to have agreed on SMILES and MDL Mol or SD files as a standard means of distribution. Both of these formats can be easily processed and translated using freely available software such as OpenBabel, CDK, and RDKit. At the current time, there do not seem to be any technical impediments to the distribution of structures and data in electronic form.

Indeed, even experimental scientists are now quite familiar with the SD and SMILES file formats and standard chemical illustration software like ChemDraw is now able to readily handle such formats, so nobody should really claim to be hobbled by the lack of a standard communication language.

Walters proposes some commonsense guidelines for aiding reproducibility in computational research, foremost of which is to include source code. These guidelines mirror some of the suggestions I had made in a previous post. Unfortunately explicit source code cannot always be made available for proprietary reasons, but as Walters notes, executable versions of this code can still be offered. The other guidelines are also simple and should be required for any computational submissions: mandatory provision of SD files; inclusion of scripts and parameter files for verification; a clear description of the method, preferably with an example; and an emphasis by reviewers on reproducibility aspects of a study.

As has been pointed out by many molecular modelers in the recent past - Ant Nicholls from OpenEye for instance - it's hard to judge even supposedly successful molecular modeling results because the relevant statistical validation has not been done and because there is scant method comparison. At the heart of validation however is simply reproduction, because your opinion is only as good as my processing of your data. The simple guidelines from this editorial should go some way in establishing the right benchmarks.

About Me

“Ashutosh (Ash) Jogalekar is a scientist and science writer based in the San Francisco Bay Area. He has been blogging at the “Curious Wavefunction” blog for more than fifteen years, and in this capacity has written for several organizations including Nature, Scientific American and the Lindau Meeting of Nobel Laureates. His main interests include the history and philosophy of science and technology, especially physics and mathematics. Professionally he is trained as a chemist and works at the intersection of chemistry, technology and drug discovery. Follow @curiouswavefn

Field of Science

The Curious Wavefunction

The structure of DNA, 61 years later: How they did it.

Drug costs and prices: Here we go again

Y Combinator and biotech: The wave of the future?

"A Fred Sanger would not survive today's world of science."

"Designing drugs without chemicals"

Reproducibility in molecular modeling research

Previous Posts

Popular Posts

Follow

Blogroll

Journals and Magazines

Archives