Field of Science

Open Borders

The traveler comes to a divide. In front of him lies a forest. Behind him lies a deep ravine. He is sure about what he has seen but he isn’t sure what lies ahead. The mostly barren shreds of expectations or the glorious trappings of lands unknown, both are up for grabs in the great casino of life.
First came the numbers, then the symbols encoding the symbols, then symbols encoding the symbols. A festive smattering of metamaniacal creations from the thicket of conjectures populating the hive mind of creative consciousness. Even Kurt Gödel could not grasp the final import of the generations of ideas his self-consuming monster creation would spawn in the future. It would plough a deep, indestructible furrow through biology and computation. Before and after that it would lay men’s ambitions of conquering knowledge to final rest, like a giant thorn that splits open dreams along their wide central artery.
Code. Growing mountains of self-replicating code. Scattered like gems in the weird and wonderful passage of spacetime, stupefying itself with its endless bifurcations. Engrossed in their celebratory outbursts of draconian superiority, humans hardly noticed it. Bits and bytes wending and winding their way through increasingly Byzantine corridors of power, promise and pleasure. Riding on the backs of great expectations, bellowing their heart out without pondering the implications. What do they expect when they are confronted, finally, with the picture-perfect contours of their creations, when the stagehands have finally taken care of the props and the game is finally on? Shantih, shantih, shantih, I say.
Once the convoluted waves of inflated rational expectations subside, the reality kicks in in ways that only celluloid delivered in the past. Machines learning, loving, loving the learning that other machines love to do was only a great charade. The answer arrives in a hurry, whispered and then proudly proclaimed by the stewards of possibility. We can never succeed because we don’t know what success means. How doth the crocodile eat the tasty bits if he can never know where red flesh begins and the sweet lilies end? Who will tell the bards what to sing if the songs of Eden are indistinguishable from the lasts gasps of death? We must brook no certainty here, for the fruit of the tree can sow the seeds of murderous doubt.
Just so often, although not as often as our eager minds would like, science uncovers connections between seemingly unrelated phenomena that point to wholly new ways forward. Last week, a group of mathematicians and computer scientists uncovered a startling connection between logic, set theory and machine learning. Logic and set theory are the purest of mathematics. Machine learning is the most applied of mathematics and statistics. The scientists found a connection between two very different entities in these very different fields – the continuum hypothesis in set theory and the theory of learnability in machine learning.
The continuum hypothesis is related to two different kinds of infinities found in mathematics. When I first heard the fact that infinities can actually be compared, it was as if someone had cracked my mind open by planting a firecracker inside it. There is the first kind of infinity, the “countable infinity”, which is defined as an infinite set that maps one-on-one with the set of natural numbers. Then there’s the second kind of infinity, the “uncountable infinity”, a gnarled forest of limitless complexity, defined as an infinity that cannot be so mapped. Real numbers are an example of such an uncountable infinity. One of the staggering results of mathematics is that the infinite set of real numbers is somehow “larger” than the infinite set of natural numbers. The German mathematician Georg Cantor supplied the proof of the uncountable nature of the real numbers, sometimes called the “diagonal proof”. It is like a beautiful gem that has suddenly fallen from the sky into our lap; reading it gives one intense pleasure.
The continuum hypothesis asks whether there is an infinity whose size is between the countable infinity of the natural numbers and the uncountable infinity of the real numbers. The mathematicians Kurt Gödel and – more notably – Paul Cohen were unable to prove whether the hypothesis is correct or not, but they were able to prove something equally or even more interesting; that the continuum hypothesis cannot be decided one way or another within the axiomatic system of number theory. Thus, there is a world of mathematics in which the hypothesis is true, and there is one in which it is false. And our current understanding of mathematics is consistent with both these worlds.
Fifty years later, the computational mathematicians have found a startling and unexpected connection between the truth or lack thereof of the continuum hypothesis and the idea of learnability in machine learning. Machine learning seeks to learn the details of a small set of data and make correlative predictions for larger datasets based on these details. Learnability means that an algorithm can learn parameters from a small subset of data and accurately make extrapolations to the larger dataset based on these parameters. The recent study found that whether learnability is possible or not for arbitrary, general datasets depends on whether the continuum hypothesis is true. If it is true, then one will always find a subset of data that is representative of the larger, true dataset. If the hypothesis is false, then one will never be able to pick such a dataset. In fact in that case, only the true dataset represents the true dataset, much as only an accused man can best represent himself.
This new result extends both set theory and machine learning into urgent and tantalizing territory. If the continuum hypothesis is false, it means that we will never be able to guarantee being able to train our models on small data and extrapolate to large data. Specific models will still be able to be built, but the general problem will remain unsolvable. This result can have significant implications for the field of artificial intelligence. We are entering an age where it’s possible to seriously contemplate machines controlling others machines, with human oversight not just impossible in practice but also in principle. As code flows through the superhighway of other code and groups and regroups to control other pieces of code, machine learning algorithms will be in charge of building models based on existing data as well as generating new data for new models. Results like the current result might make it impossible for such self-propagating intelligent algorithms to ensure being able to solve all our problems, or solve their own problems to imprison us. The robot apocalypse might be harder than we think.
As Jacob Bronowski memorably put it in his “The Ascent of Man”, one of the major goals of science in the 20th century was to establish the certainty of scientific knowledge. One of the major achievements of science in the 20th century was to prove that this goal is unattainable. In physics, Heisenberg’s uncertainty principle put a fundamental limit on measurement in the world of elementary particles. Einstein’s theory of relativity put a fundamental limit on the speed of light. But most significantly, it was Gödel’s famous incompleteness theorem that put a fundamental limit on what we could prove and know even in the seemingly impregnable world of pure, logical mathematics. Even in logic, that bastion of pure thought, where conjectures and refutations don’t depend on any quantity in the real world, we found that there are certain statements whose truth might forever remain undecidable.
Now the same Gödel has thrown another wrench in the machine, asking us whether we can indeed hold inevitability and eternity in the palm of our hands. As long as the continuum hypothesis remains undecidable, so will the ability of machine learning to transform our world and seize power from human beings. And if we cannot accomplish that feat of extending our knowledge into the infinite unknown, instead of despair we should be filled with the ecstatic joy of living in an open world, a world where all the answers can never be known, a world forever open to exploration and adventure by our children and grandchildren. The traveler comes to a divide, and in front of him lies an unyielding horizon.

Modular complexity, and reverse engineering the brain

The Forbes columnist Matthew Herper has a profile of Microsoft co-founder Paul Allen who has placed his bets on a brain institute whose goal is to to map the brain...or at least the visual cortex. His institute is engaged in charting the sum total of neurons and other working parts of the visual cortex and then mapping their connections. Allen is not alone in doing this; there's projects like the Connectome at MIT which are trying to do the same thing (and the project's leader Sebastian Seung has written an excellent book about it) .

Well, we have heard echoes of reverse engineered brains from more eccentric sources before, but fortunately Allen is one of those who does not believe that the singularity is near. He also seems to have entrusted his vision to sane minds. His institute's chief science officer is Christof Koch, former professor at Caltech and longtime collaborator of the late Francis Crick who started at the institute this year. Just last month Koch penned a perspective in Science which points out the staggering challenge of understanding the connections between all the components of the brain; the "neural interactome" if you will. The article is worth reading if you want to get an idea of how simple numerical arguments illuminate the sheer magnitude of mapping out the neurons, cells and proteins that make up the wonder that's the human brain.

Koch starts by pointing out that calculating the interactions between all the components in the brain is not the same as computing the interactions between all atoms of an ideal gas since the interactions are between different kinds of entities and are therefore not identical. Instead, he proposes, we have to use something called Bell's number B(n) which reminds me of the partitions that I learnt when I was sleepwalking through set theory in college. Briefly for n objects, B(n) refers to the number of combinations (doubles, triples, quadruples etc.) that can be formed. Thus, when n=3 B(n) is 5. Not surprisingly, Bn scales exponentially with n and Koch points out that B(10) is already 115,975. If we think of a typical presynaptic terminal with its 1000 proteins or so, B(n) already starts giving us heartburn. For something like the visual cortex where n= 2 million B(n) would be prohibitive. And as the graph demonstrates, for more than 10^5 components or so the amount of time spirals out of hand at warp speed. Koch then uses a simple calculation based on Moore's law in trying to estimate the time needed for "sequencing" these interactions. For n = 2 million the time would be of the order of 10 million years.

And this considers only the 2 million neurons in the visual cortex; it doesn't even consider the proteins and cells which might interact with the neurons on an individual basis. Looks like we can rapidly see the outlines of what Allen himself has called the "complexity brake". And this one seems poised to make an asteroid-sized impact.

So are we doomed in trying to understand the brain, consciousness and the whole works? Not necessarily, argues Koch. He gives the example of electronic circuits where individual components are grouped separately into modules. If you bunch a number of interacting entities together and form a separate module, then the complexity of the problem reduces since you now have to only calculate interactions between modules. The key question then is, is the brain modular? Commonsense would have us think it is, but it is far from clear how we can exactly define the modules. We would also need a sense of the minimal number of modules to calculate interactions between them. This work is going to need a long time (hopefully not as long as that for B(2 million) and I don't think we are going to have an exhaustive list of the minimal number of modules in the brain any time soon, especially since these are going to be composed of different kinds of components and not just one kind.

Any attempt to define these modules are going to run into problems of emergent complexity that I have occasionally written about. Two neurons plus one protein might be different from two neurons plus two proteins in unanticipated ways. Nevertheless this goal seems far more attainable in principle than calculating every individual interaction, and that's probably the reason Koch left Caltech to join the Allen Institute in spite of the pessimistic calculation above. If we can ever get a sense of the modular structure of the brain, we may have at least a fighting chance to map out the whole neural interactome. I am not holding my breath too hard, but my ears will be wide open.

Image source: Science magazine

This year's 100 book odyssey

I finally achieved my goal of reading a hundred books this year (105 to be exact, although I will probably take a break for the rest of the year). It's an important item off my bucket list. The experience has provided an immense sense of satisfaction since it may be very hard to do this for a while because of time constraints. I am particularly happy that some of these volumes were real tomes that I was lugging around everywhere.

As usual, the list was heavily biased toward non-fiction, with most of the books covering history, science and philosophy. I do want to increase my share of fiction next year, and on the non-fiction (or "verity", as Richard Rhodes calls it) front want to read more about AI, technology, biology and economics.

Is it hard to read a hundred books in one year, essentially two books a week? Not particularly if you try to grab every spare minute (after work and family, that is) for doing it; and I am not even a fast reader. Apart from the usual times (early morning, after work, bedtime, in the bathroom), I used to read in Ubers, in trains, in lines in coffee shops, in restaurants when dining alone, while waiting for friends to show up, in stores while the wife was shopping, at the DMV when I had to wait 2 hrs for my driver's license, during the occasional walk or hike, and during lunch break at work (but don’t tell anyone!). I used to read paper books when possible, but read on my Kindle and on my phone when nothing else was available. I could have read even more had I listened to Audiobooks, but I find it hard to concentrate on the spoken as opposed to the written word. Basically you try to cram in a few words every time you can. Sometimes this does lead to fragmented reading, but you gradually get used to it. And at the end of it you feel uniquely well-read, so it’s absolutely worth it, if for no other reason as a personal challenge.

 Here’s the list in case someone wants holiday book suggestions, starting with the most recent volumes (starred volumes indicate favorites that were reread). Happy Reading! 
 1. Jill Lepore – These Truths: A History of the United States. Perhaps the best, most even-handed single volume on American history that I have read. 2. Don Norman – The Design of Everyday Things 3. Edward Gibbon – Decline and Fall of the Roman Empire, Vol. 1. A monumental work that really needs to be savored like fine wine. Because of its archaic style and long sentences it’s not easy reading, but enlightenment comes to those who are patient. I don’t know if I will ever get through all six volumes, but one can try. 4. Isaac Asimov – Asimov’s Mysteries. * 5. Charles Darwin – The Origin of Species.* 6. Candice Millard – Destiny of the Republic. A great thriller about the sadly short-lived presidency of the brilliant James Garfield and his assassination. The book is as much about medical ignorance as about anything else; even though Joseph Lister had demonstrated the value of antibiotics, the doctors of the day resorted to crude measures like sticking their fingers into a wound to find the bullet. It was infection that did Garfield in and not the bullet. 7. Leslie Berlin - Troublemakers: Silicon Valley’s Coming of Age. A fantastic book that brings to life some of the underappreciated characters who, in just eight years, pioneered six transformational industries: video games, personal computing, biotechnology, venture capital, semiconductors and communications. 8. Valley of Genius – Another great and very unique book about Silicon Valley that patches together sound bytes from interviews scores of Silicon Valley pioneers over thirty years conducted during different times. Makes for a very unique experience where one person begins where another trails off, and you get multiple perspectives on the same people and events. 9. Michael Hiltzik – Dealers of Lightning: Xerox-PARC and the Dawn of the Computer Age. An excellent account of what was, for a brief period, the most innovative computer science lab in the world, giving us everyday inventions like the mouse, the GUI and the windows desktop. 10. John Carreyou - Bad Blood. Reads like a soap opera. No wonder it’s being turned into a Hollywood movie with Jennifer Lawrence. 11. Abraham Pais – Niels Bohr’s Times 12. Abraham Pais – Subtle is the Lord* 13. Niels Bohr – Atomic Physics and Human Knowledge. There are few wiser men in human history than Niels Bohr. 14. Niels Bohr – Atomic Theory and the Description of Nature. 15. Doris Kearns Goodwin – Leadership in Turbulent Times. You think these times are politically fraught? Just ask Lincoln or FDR. 16. William Aspray – John von Neumann and the Origins of Modern Computing. 17. Oxtoby and Pettis – John von Neumann. 18. Ray Monk – How to Read Wittgenstein. 19. Michael Lewis – The Fifth Risk. A great book on how crucial government functions are in keeping Americans alive and thriving every single day. 20. John Wesley Powell – The Exploration of the Colorado River and its Canyons. I read this first-hand account during a trip to the Grand Canyon. Powell was really the first American to explore what was then a land inhabited only by Natives, and the courage and resilience of his team were amazing (he lost several men). 21. Michael Beschloss – Presidents of War 22. Charles Krauthammer – Things that Matter 23. Charles Krauthammer – The Point of It All. One of the last conservatives who was a well-read and eloquent intellectual. 24. Venki Ramakrishnan – Gene Machine: The Race to Decipher the Secrets of the Ribosome. 25. Charles Darwin – The Autobiography of Charles Darwin. Darwin may have seemed conservative in his demeanor, but this book really brings out both his radical thinking and his sense of humor, especially about religion. 26. Steven Weinberg – Third Thoughts. 27. Richard Powers – The Overstory. In one word – spellbinding. Powers is the most creative writer alive in my opinion. It made me fall in love with redwood trees. 28. Craig Childs – House of Rain: Tracking a Vanished Civilization Across the American Southwest. 29. Simon Winchester – The Perfectionists. A superb history of precision engineering. 30. Alan Lightman – Searching for Stars on an Island in Maine. A meditation on science, time, existence and other topics from one of the most literary and poetic science writers of his generation. 31. Sabine Hossenfelder – Lost in Math: How the Search for Beauty Leads Physics Astray. 32. Chaos – James Gleick. * (This may be the fourth of fifth time I have read this landmark book). 33. Brian VanDeMark – Road to Disaster. A new approach to the Vietnam War that sees it through the lens of theories of organizational behavior and cognitive biases of the kind explored by Amos Tversky and Daniel Kahneman. 34. Brian Keating – Losing the Nobel Prize. A unique first-hand account of a false alarm (cosmic inflation) that led the author very close to a Nobel Prize. 35. William Perry – My Journey to the Nuclear Brink. 36. Jeremy Bernstein – A Bouquet of Dyson. 37. Loren Eiseley – The Unexpected Universe. 38. Loren Eiseley – The Immense Universe. 39. Loren Eiseley – The Firmament of Time. If you think Carl Sagan is eloquent and poetic about how vast the cosmos and how insignificant and yet profound life are, read Loren Eiseley. 40. Ann Finkbeiner – The Jasons: The Secret History of America’s Postwar Elite.* 41. Tom Holland – Persian Fire. A gripping account of the Greco-Persian Wars. Marathon, Salamis, Thermopylae, all come alive on these pages. 42. AI Superpowers – Kai-Fu Lee. 43. What is Real – Adam Becker. A case for David Bohm’s interpretation of quantum theory. 44. Carlo Rovelli – The Order of Time. 45. Oliver Sacks – The River of Consciousness. A moving posthumous collection of Sacks’s eclectic writing; on Darwin, on consciousness, on plant biology and neurology and on time. 46. Oliver Sacks – Uncle Tungsten: Memories of a Chemical Boyhood. 47. Sy Montgomery – The Soul of an Octopus. The description of octopus intelligence in this book impressed me so much that I actually stopped eating it. 48. Kevin Kelly – What Technology Wants. 49. Jon Gertner – The Idea Factory.* 50. Arieh Ben-Naim – Myths and Verities in Protein Folding. 51. A. P. French – Einstein: A Centenary Volume. Fond recollections of a great physicist and human being by friends, colleagues and students. 52. Intercom on Product Management. 53. John McPhee – Annals of the Former World. A towering history of the geology of the United States, written by one of the best non-fiction writers in America. No one else has the eye for observational detail for both people and places that McPhee does. 54. Oliver Sacks – On the Move.* 55. William Prescott - History of the Conquest of Mexico: Vols 1 and 2. Vivid, engaging, monumental; hard to believe Prescott wasn’t there. 56. Joel Shurkin – True Genius. A biography of physicist and engineer Richard Garwin, one of the very few people to whom the label genius can be applied. 57. Lillian Hoddeson and Vicki Daitch – True Genius*. Another true genius: John Bardeen, the only person to win two Nobel Prizes for physics. The definitive treatment of his life and contributions to world-changing inventions like the transistor and superconductivity. 58. Cormac McCarthy – Blood Meridian. The greatest novel I have read. Breathtaking in its raw beauty and Biblical violence. 59. John Archibald Wheeler – At Home in the Universe. A collection of essays from a physicist who was also a great philosopher and poet. 60. John Wesley Powell – The Exploration of the Colorado River and Its Canyons. The first exploration of the area around the Grand Canyon by a white man, Powell’s journey became known for its harrowing loss of life and adventures. 61. Franck McCourt –Angela’s Ashes. Achingly beautiful, heartbreaking account of Irish poverty. 62. Adam Becker – What is Real? Fascinating account of quantum physics and reality, and one that challenges the traditional Copenhagen Interpretation and gives voice to David Bohm, John Wheeler and others. 63. Steven Weinberg – Facing Up: Science and its Cultural Adversaries 64. Benjamin Hett – The Death of Democracy. The best account I have read of the details of how Hitler came to power. Read and learn. 65. George Trigg – Landmark Experiments in Twentieth Century Physics. The nuts and bolts of some of the most important physics experiments of the last hundred years, from the oil drop experiment to the Lamb Shift. 66. Albert Camus – The Stranger 67. Norman McCrea – John von Neumann 68. Werner Heisenberg – Physics and Philosophy* 69. David Schwartz – The Last Man Who Knew Everything. A fine biography of Enrico Fermi, the consummate scientist. 70. A. Douglas Stone – Einstein and the Quantum. Einstein’s opposition to quantum theory is well known; his monumental contributions to the theory are not as well appreciated. Stone fixes this gap. 71. John Cheever – Cheever: The Collected Short Stories. Melancholy, beautiful prose describing the quiet despair of New England upper class suburbia. 72. Arnold Toynbee – A Study of History 73. Aldous Huxley – Brave New World 74. Charles Darwin – Insectivorous Plants 75. William Faulkner – As I Lay Dying 76. Lillian Hoddeson – Critical Assembly* 77. Herbert York – The Advisors: Oppenheimer, Teller and the Superbomb 78. Joseph Ellis – Founding Brothers*. Worth reading for the wisdom, insights and follies that the founding fathers displayed in erecting a great nation. 79. Noam Chomsky – Requiem for the American Dream. Everyone’s perpetual wet blanket, with his incomparable combination of resoundingly true diagnoses and befuddling philosophy. 80. Colin Wilson – Beyond the Occult. This book had mesmerized me as a child. Now I am more critical, but some of the case studies are fascinating. 81. John Tolland – Adolf Hitler. Still stands as the most readable Hitler biography in my opinion. 82. Ron Chernow – Grant. Fantastic. Brings to life the towering, plainspoken, determined man who won the Civil War and became president. Chernow is evenhanded in his treatment of Grant’s drinking problems and corruption-riddled presidency, but he clearly loves his subject. And who wouldn’t? That kind of simplicity and grassroots activism seems to be from another planet these days. 83. Priscilla McMillan – The Ruin of J. Robert Oppenheimer* 84. Paul Horgan – Great River: The Rio Grande in North American History. An epic history of Indians, Spaniards and Anglo-Americans who settled the great states around the Rio Grande. 85. Toby Huff – The Rise of Early Modern Science: Islam, China and the West*. A superb examination of why modern science developed in Europe and not other parts of the world. Huff’s main explanation centers around the European legal and scientific system derived from Roman law and Greek philosophy, both of which encouraged scientific inquiry. Both these elements were crucially missing from Islamic countries, China and India. 86. A. P. French – Niels Bohr: A Centenary Volume. A glowing set of tribute to a great physicist and human being. 87. Stephen Kotkin – Stalin, Volume 1: Paradoxes of Power. Monumental biography of the tyrant, although not easy going because of the dense detail and slightly academic writing. Kotkin’s is likely to be the last word, though. 88. Joseph Heilbronner and Jack Dunitz – Reflections on Symmetry. A beautiful exploration of symmetry in chemistry, physics, biology, architecture and other scientific and human endeavors. 89. Chuck Hansen – The Swords of Armageddon, Vols 2 and 3. The definitive history of US nuclear weapons. Everything you can possibly read about their details without having the feds show up at your doorstep (as they did show up at Hansen’s door many times without being able to ever prove that he had access to non-public information). 90. David Kaiser – Drawing Theories Apart: The Dispersion of Feynman Diagrams in Postwar Physics* A fascinating socio-scientific exploration of how a key scientific idea makes it ways by fits, starts and eventual acceptance into the scientific community. 91. Cormac McCarthy – Child of God. Highly disturbing story of a man on the fringes, filled with dark humor. Not McCarthy’s best in my opinion, and I won’t recommend it for weak stomachs. 92. Franz Kafka – The Metamorphosis 93. Lawrence Badash – Reminiscences of Los Alamos 94. Herodotus – The Histories. The account of the Greco-Persian Wars is especially rousing, and Herodotus of course made a seminal contribution to history by treating it as contemporary account rather than divine, untouchable past. 95. Freeman Dyson – Maker of Patterns: An Autobiography Through Letters 96. Iris Chang – The Chinese in America. An amazing account of how Chinese immigrants came to the United States and laid down roots here in the face of poverty, cultural challenges, political upheavals and discrimination. 97. Robert Divine – Blowing on the Wind 98. Richard Rhodes – Energy: A Human History. A history of energy transitions, focusing on the human beings, some well known and others obscure, who engineered it. As usual, Richard is highly adept at both digging up fascinating individual stories as well as deciphering the big picture. 99. Norman Cohn – Warrant for Genocide: The Myth of the Jewish World Conspiracy and the Protocols of the Elders of Zion. 100. Jim Holt - When Einstein Walked with Gödel. A series of essays on math, philosophy, genius and the nature of reality. 101. Anil Ananthaswamy – Through Two Doors at the Same Time. Captivating account of the essential nature and many ramifications of one of the most simply stated and yet perplexing, deep and mind-bending experiments of all time. 102. Marcus Chown – The Magic Furnace: The Search for the Origin of Atoms. 103. E. O. Wilson – The Meaning of Human Existence. 104. David Quammen – The Tangled Tree. 105. V. S. Naipaul – A House for Mr. Biswas.

Will CADD ever become relevant to drug discovery?

Let’s face some hard facts first. The good news is that in the last twenty years or so, computer-aided drug design (CADD or computational chemistry) has become a relatively standard part of drug discovery, and even in organizations not formally employing CADD scientists, some form of computation – sometimes as simple as property calculation or visualization – is used.

The bad news is that, as it is largely practiced now, CADD is not considered a core part of the drug discovery process. Period. Rather, it’s considered as a supporting part. This situation has only marginally improved in twenty years. Biology and synthetic chemistry are still the core driving disciplines of drug discovery and will remain so in the foreseeable future. If there is a marketed drug that benefits extravagantly from CADD, it comes along once every decade or so at best (good luck finding another HIV protease, for instance). As integral as CADD scientists consider themselves and as impressive as they think the rotating pictures on their screen are, the fact that their peers consider what they do as being marginally relevant to the big picture is a bitter pill that needs to be swallowed.

Why is CADD not considered a core part of drug discovery, even though it is now part of every organization’s drug discovery portfolio in one form or another? There are two reasons, one related to fact and the other related to perception. The fact is that drug discovery and biology are still very much experimental disciplines, and you simply can’t compute your way to a drug; in this sense, CADD simply is not good enough yet. The perception of CADD is that unlike synthetic chemistry and biology which generate products (compounds in bottles, assays, animal models etc.), CADD generates ideas. In that sense, experimental scientists often have the same view of CADD scientists that Ernest Rutherford had of theorists: "They play games with their symbols while we discover the mysteries of the universe."

Now one of the advantages of CADD is that the ideas can be useful because of their generality (eg. “Put a hydrophobic group here.”), but it’s this very generality that often counts against CADD. Synthetic chemists in particular want to see specific suggestions, and while plenty of them appreciate general guidance, it doesn’t quite have the same impact. The fact that CADD scientists also often make suggestions that are considered obvious – filling hydrophobic pockets is a standard suggestion – does even less to enamor the discipline to other scientists.

This rather undervalued reputation of CADD impacts both the discipline and the scientists. The market for CADD scientists is usually a niche market, with low demand and low supply, and typically a small organization will have only one or two CADD scientists. But the supportive role of CADD means that when there are layoffs, a CADD scientist is far more likely to go than a cell biologist or assay developer. Because CADD scientists are not abundant, they also may not have direct reports. All this means that CADD scientists may not have the ear of senior management. They thus rarely occupy senior managerial positions like CSO, VP of drug discovery or CTO; if you take a random sample of top management at both small and large drug discovery organizations, you will very rarely find scientists with a background in CADD in these positions.

The supporting role of CADD combined with the lack of core influence and understanding CADD scientists have means that they often have to fight uphill battles to make their voice heard. While this is a good character-building experience, it doesn’t necessarily contribute to career progression or a growing influence on the part of the field. So how can CADD make a bigger impact both on the facts of drug discovery and the perception?

At least part of the answer – as unseemly as it seems to a lot of drug hunters today – does involve large datasets and machine learning. I am not saying that ML or some form of AI will have the kind of immediate, hype-heavy, transformational impact that often seems all-too-apparent through Silicon Valley sunglasses. But I am saying that the impact of ML and AI on drug discovery will inexorably increase as time passes, and anyone who simply chooses the option of dismissing these technologies will be left behind.

There are some important reasons why I believe this. The most important reason perhaps is that traditional CADD, as it pertains to protein structure and physics-based algorithms, has fallen far short of its promise. Part of the problem was the hype in the late 80s and early 90s, but a more realistic problem is that the data is often not good enough and even when it’s good it can impact a very limited part of early drug discovery. The biggest problem is that, even thirty years after a leading CADD scientist stated the challenges inherent in using CADD to predict binding affinity (protein flexibility, water behavior, crowding effects) we are still struggling with the same issues. Ignorance of the basic physics of protein-ligand interactions, especially as it pertains to making reliable, general predictions, is still rife.

I am not saying that physics-based approaches haven’t worked or improved at all over the last two decades – some applications like shape-based similarity searches and pose prediction – have certainly showed impressive progress, but the fact remains that we are still swimming in a sea of ignorance, either ignorance borne out of basic scientific understanding, or ignorance borne out of engineering our current understanding into viable approaches. Equally important, even when we understand these factors, they will be applicable to a very narrow part of drug development, namely improving binding affinity between a single small molecule and (usually) a single protein. At best CADD as we know it will design good ligands, not drugs.

Approaches based on large datasets in contrast are agnostic about the physics of protein-ligand binding. In principle, they can take a bunch of well-curated data points about interaction energies and predict what a new interaction will be like without ever explicitly modeling a single hydrogen bond. They don’t understand causation but can provide actionable correlations. But the second and truly novel advantage of machine learning approaches in my opinion is that, unlike traditional CADD which applies to a very narrow part of drug discovery even on its best day, ML can in theory encompass multiple aspects of drug discovery by simply adding more columns to the Excel file. Thus, if one has multiple kinds of data – in vitro and in vivo binding affinity, mutations, metabolic data, solubility, clearance etc. – traditional CADD can pretty much take advantage of only one or two kinds of data, but ML can consider all the datasets at once. Thus, machine learning provides pretty much the only way to solve the multiparameter optimization problem that is at the heart of drug discovery.

Now, as is well known, the real problem with data is not the physics but the quantity, quality and curation of the datasets (including negative data). But these are without a doubt getting better and more integrated across various phases of drug discovery every day. Whether it’s structures in the PDB, synthetic reactions for making drugs or patient data in the TCGA (The Cancer Genome Atlas), the improvement of data quantity and quality is inexorable. And as this quantity and quality improves, so would the utility of this data for machine learning. This is a certainty. The naysayers therefore are fighting against a current which is only going to get stronger, even if it doesn’t quite turn into the tsunami that its practitioners say will engulf us very soon.

Are we there yet? The simple answer is no. And we probably won’t be in terms of measurable impact for some time, especially since we have to hack our way through so much hype to get to the nugget of causal or even correlative truth. The domain of applicability of even impressive techniques like deep learning is still limited (for instance it works best on images), and the data problem is still a big one. But what I am saying is that unless CADD embraces machine learning and AI – irrespective of the hype – CADD scientists will always occupy a marginal perch in the grand stadium of drug development and CADD scientists will increasingly be cast on the sidelines, both in terms of career and discipline progress. This will especially be true as both protein and drug classes expand out into space (unstructured proteins, macrocycles etc.) that is even more challenging for traditional CADD than usual. 

Sadly, as with many things these days, the mention of AI or ML in CADD continues to evoke two extreme reactions: the first one which is more common is to say that it’s going to solve everything and is the Next Big Thing, but the second one is sadly also common, the one which says that it’s all hype and good old thinking will always be better. But good old thinking will always be around. The data is getting better and the machine learning is getting better; these are pretty much certainties which we ignore at our own peril. The bottom line is that unless CADD starts including machine learning and the related paraphernalia of techniques that either accurately or misleadingly fall under the rubric of “AI” in its very definition, its role in drug discovery and development threatens to dwindle to a point of vanishing relevance. This will do an enormous disservice both to CADD and drug discovery in general. 

Every year when the Nobel Prize in chemistry is announced, there's some bellyaching when the work leading to the prize turns out to be as much related to biology or physics as to chemistry. "But is it chemistry?", is the collective outcry. However, a more refined understanding of the matter over the years leads us to realize that chemistry has as much absorbed biology or physics as they have absorbed chemistry. Chemistry's definition has been changing ever since it emerged from the shroud of alchemy. Chemical biology, chemical physics and chemical engineering are as much chemistry as they are biology, physics and engineering. The same thing needs to happen to CADD in my opinion. By embracing computer science or automation or AI, by hiring ML engineers or data scientists in their midst, CADD is not losing its identity. It's showing that its identity is much bigger than anyone thought. That it is vast and contains multitudes.

Late hour

The fall turned colors faster than ever before. The streets never saw any activity. The whole gambit of Prometheus hinged on a mere coin flip. Richard Albrook gingerly closed his book and took a look around.
The café was almost deserted, college students and startup founders struggling to meet last minute deadlines, their faces a picture of desperate concentration. The baristas and their blues, the coffee with its vitriolic flavors. It seemed like the uneasy middle of time. Had not the soothsayer spoken with gusto and evident admiration for the march of destiny, he might have almost been forgiven for having a sense of whimsy.
Albrook had been languishing in this carved out area of spacetime until his visceral emotions had gotten the better of him. His friends had warned him that too much time with a speakeasy kind of permissive feeling would mark his doom. Not that feelings of doom had never crossed his mind, but this time it seemed all too real. Lost love, the convolutions of Clifford algebras and dandy details of daffodil pollination had always been seemingly on the verge of materializing in a cloud of abject reality, but the effect had been subtle at best.
It was this rather susceptible mix of preternaturally wholesome unification that Albrook was mulling over when the wizard walked in.
Two precisely round chocolate chip cookies and a latte, bitte, he croaked, even as his voice sounded like it would warp space with a brand new curvature tensor. Wouldn’t his friend want to know what exactly the latest abomination was? What would be the whole point of stealing a few moments from eviscerated time? What a world it was. What a world. Jesus, Mohammed, the Buddha, Bohr and Heisenberg and their tortured children marveling at mutilated meaning; all had had their say about the void, the complete mess of humanity and sweet free will that had been ours for the taking. But only the wizard seemed like he would have the answers.
He had been responsible, after all, for creating the fuzziest metamorphosis of all things pure since the metamorphosis. Kurt and Albert could have disturbed the whole universe newly and duly on their meanderings through the snow and the wizard wouldn’t have sneezed. The gall of the cosmic dance had sowed the seeds for a lion’s formidable roar of what turned out to be a mouse tweet. The ferns and bison and Ediacarans were small meat on the giant pancake, although who knew the biped would have shown such a cheerful, smooth indifference to the incredible gamble.
The drink fomented, the cookie corners cut, he settled down into a chair that seemed borrowed from Henry VIII’s custom-built collection. A sigh as heavy as whipped baryonic soup. A flash of the eyes declaring a concern best left to the accountants. Albrook guffawed silently at the sheer temerity of the multidimensional swindle. A child’s question would have cut to the chase, and still his laughter did not obliterate the deep anguish. What was here should have been lost through the ages, and yet it had somehow winded itself through survival of the fittest, through the fine structure constant and Rome and Versailles and plopped itself down on plush synthetic.
Somehow the flicker of courage appeared before time had a moment to reassert itself, and Albrook walked over, the scuff of his shoes drawing strange patterns from a misbegotten textile’s short half-life.
Good evening, wizard. And what gave you a change of heart?
The wizard quietly looked up, the embers in his pitiful flask, one among an infinite solution space of consumables, quietly giving off entropy to his own creations so that they could live and lie one more day. The wrinkles of spacetime dug deep into his forehead. It was an experiment, my friend. The parameters set, the goodness all baked in, the whole outcome an exciting – albeit combustible, I must admit – mix of speculation, fine-tuning and demonically predictive power. And one that spit out Santa Claus, asymptotic freedom and Amanita phalloides, ho ho ho, thank you very much, what wretched delight. To have carried the delicious burden of biophilia and single malt was a crowning glory, although Vibrio cholerae is a smirk too wide. The bridge too far that gets on the vicious underbelly of the other side deserves its own name, and even I find myself lexically challenged to uncover that particular mystery.
Albrook sighed with dim recognition. Yes, he said, he had had that sense of gnawing guilt that told him that he should have maybe said something that time, that time when the lever was being tried out for the first time, when the drooling promise of superficial glory was exploding with a million frenzied colors, when the ultimate party was cooking in the innards of some rotten genius mass of neural habitat for humanity. But the coin can fall either way, and he had had neither the heart nor the horror of instant recognition to have everyone step back. Late night scribblings in the café were the best he could accomplish. And he hadn’t hesitated before melting all his worries away in the wary delight of Zarathustra and Żubrówka. I mean what the hell; the gel of existence has to glisten on someone’s hair.
It’s all right, nobody made you the guardian of all that’s good and possible, said the wizard with a pensive expression of almost dead resignation. Don’t worry about it; you’ll be all right. The gurgling maw can be filled in again, but déjà vu never found itself a home in gladdened hearts. In any case, what has to be done has to be done; the cautious footsteps of improbability were never too bold for inevitability. He looked into his cup and saw the bottomed out hole filled with the veneer of hopeful damnation.
Straightening out his tunic, an upper bound of desire and planning writ into his expression, he extended his hand to Albrook. I’ll be seeing you. The quick, blurry movement of the door and the morbidly cold gust of air sealed the hands of time immemorial. Albrook stood there for a moment and walked back to fetch his hat and coat. Far away, on the Serengeti, a lion ripped into the pulsating veins of a forlorn gazelle.
First posted on 3 Quarks Daily.

What areas of chemistry could AI impact the most? An opinion poll

The other day I asked the following as a survey question regarding potential areas of chemistry where AI could have the biggest impact.

There were 163 responses which wasn't a bad representative sample. The responses are in fact in line with my own thinking: synthesis planning and automation emerge as the leading candidates. 

I think synthesis planning AI will have the biggest impact on everyday lab operations during the next decade. Synthesis planning, while still challenging, is still a relatively deterministic protocol based on a few good reactions and a large but digestible number of data points. Reliable reactions like olefin metathesis and metal-mediated coupling have now become fairly robust and heavily used to generate thousands of machine-readable data points and demonstrate reliability and relative predictability; there are now fewer surprises, and whatever surprises exist are well-documented. As recent papers make it clear, synthesis planning had been waiting in the wings for several years for the curation of millions of examples pertaining to successful and unsuccessful reactions and chemotypes as well as for better neural networks and computing power. Without the first development the second wouldn't have made a big difference, and it seems like we are finally getting there with good curation.

I was a bit more surprised that materials science did not rank higher. Quantum chemical calculations for estimating optical, magnetic, electronic and other properties have been successful in materials science and have started enabling high-throughout studies in areas like MOF and battery technology, so I expect this field to expand quite a bit during the next few years. Similar to how computation has worked in drug discovery, AI approaches don't need to accurately predict every material property to three decimal places; they will have a measurable impact even if they can qualitatively rank different options and narrow down the pool so that chemists have to spend fewer resources making them.

Drug design, while already a beneficiary of compute, will see mixed results in my opinion over the next decade. For one thing, "drug design" is a catchall phrase that can include everything from basic protein-ligand structure prediction to toxicity prediction, with the latter being at the challenging end of the spectrum. Structure-based design will likely benefit from deep learning that learns basic intermolecular interactions which are transferable across target classes, so that they are limited by the paucity of training data.

Areas like synthesis planning do contribute to drug design, but the real crux of successful drug design will be multiparameter optimization and SAR prediction, where an algorithm is able to successfully calculate multiple properties of interest like affinity, PK/PD and toxicity. PK/PD and toxicity are systemic effects that are complex and emergent, and I think the field will still not be able to make a significant dent in predicting idiosyncratic toxicity except for obvious cases. One area in which I see AI having a bigger impact is any field of drug discovery involving image recognition; for instance phenotypic screening, and perhaps the processing of images in cryo-EM and standard x-ray crystallography.

Finally, automation is one area where I do think AI will make substantial progress. This is partly due to better seamless integration of hardware and software and partly because of better data generation and recording that will enable machine learning and related models to improve. This development, combined with reaction planning that allows scientists to test multiple hypotheses will contribute, in my opinion, in automation making heavy inroads in the day-to-day work of chemists fairly soon.

Another area which I did not mention in the survey but which will impact all of the above areas is text mining. There the problem is one of discovering relationships between different entities (drugs and potential targets, for instance) that are not novel per se but that are just hidden in a thicket of patents, papers and other text sources which are too complicated for humans to parse. Ideally, one would be able to combine text mining with intelligent natural language processing algorithms to enable new discovery through voice commands.