Field of Science

The three horsemen of the machine learning apocalypse

My colleague Patrick Riley from Google has a good piece in Nature in which he describes three very common errors in applying machine learning to real world problems. The errors are general enough to apply to all uses of machine learning irrespective of field, so they certainly apply to a lot of machine learning work that has been going on in drug discovery and chemistry.

The first kind of error is an incomplete split between training and test sets. People who do ML in drug discovery have encountered this problem often; the test set can be very similar to the training set, or - as Patrick mentions here - the training and test sets aren't really picked at random. There should be a clear separation between the two sets, and the impressive algorithms are the ones which extrapolate non-trivially from the former to the latter. Only careful examination of the training and test sets can ensure that the differences are real.

Another more serious problem with training data is of course the many human biases that have been exposed over the last few years, biases arising in fields ranging from hiring to facial recognition. The problem is that it's almost impossible to find training data that doesn't have some sort of human bias (in that context, general image data usually works pretty well because of the sheer number of random images human beings capture), and it's very likely that this hidden bias is what your model will then capture. Even chemistry is not immune from such biases; for instance, if your training data contains compounds synthesized using metal-catalyzed coupling reactions and is therefore enriched in biaryls, you will be training an algorithm that is excellent at identifying biaryls, drug scaffolds that are known to have issues with stability and clearance in the body.

The second problem is that of hidden variables, and this is especially the case with unsupervised learning where you let loose your algorithm on a bunch of data and expect it to learn relevant features. The problem is that there are a very large number of features in the data that your algorithm could potentially learn and find correlations with, and a good number of these might be noise or random features that would give you a good correlation while being physically irrelevant. A couple of years ago there was a good example of an algorithm used to classify tumors learning nothing about the tumors per se but instead learning features of rulers; it turns out that oncologists often keep rulers next to malignant tumors to measure their dimensions, and these were visible in the pictures. 

Closer to the world of chemistry, there was a critique last year of an algorithm that was supposed to pick an optimal combination of reaction conditions for a synthetic Buchwald-Hartwig reaction. This is a rather direct application of machine learning in chemistry, and one of the most promising ones in my view, partly because reaction optimization is still very much a trial-and-error art and it is far more deterministic than, say, finding a new drug target based on sparse genomic correlations. After the paper was published there was a critique pointing out that you could get the same results if you randomized the data or fit the model on noise. That doesn't mean the original model was wrong, it means that it wasn't unique and wasn't likely causative. Basically asking what exactly your model is fitting to is always a good idea.

As Patrick's article points out, there are other examples like an algorithm latching on to edge effects of plates in a biological assay or in image analysis in phenotypic screening; two other applications very relevant to drug discovery. The remedy here again is to run many different models while asking many different questions, a process that needs patience and foresight. Another strategy which I increasingly like would be to not do unsupervised learning but instead do constrained learning, with the constraints coming from the laws of science.

The last problem is a bit more subtle and involves using the wrong objective or "loss" function. A lot of this boils down to asking the right question. Patrick cites the example of using ML to diagnose diabetic retinopathy using images of the back of the eye. It turns out that if the question they asked was focused more on diagnosing a single disease rather than whether the patient needs to see a doctor, the models were thrown into disarray.

So what's the solution? What it always has been. As the article says,

"First, machine-learning experts need to hold themselves and their colleagues to higher standards. When a new piece of lab equipment arrives, we expect our lab mates to understand its functioning, how to calibrate it, how to detect errors and to know the limits of its capabilities. So, too, with machine learning. There is no magic involved, and the tools must be understood by those using them."
Would you expect the developer of a new NMR technique or a new quantum chemistry calculation algorithm to not know what lies under the hood? Would you expect these developers to not run tests using many different parameters and under different controls and conditions? For that matter, would you expect a solider to go into battle without understanding the traps the enemy has laid? Then why expect developers of machine learning to operate otherwise? 
Some of it is indeed education, but much of it involves the same standards and values that have been part of the scientific and engineering disciplines since antiquity. Unfortunately, too often machine learning, especially because of its black-box nature, is regarded as magic. But there is no magic (Arthur Clarke quotes notwithstanding). It's all careful, meticulous investigation, it's about going into the field knowing that there almost certainly will be a few mines scattered here and there. Be careful if you don't want you/r model to get blown up.