Field of Science

Showing posts with label data analysis. Show all posts
Showing posts with label data analysis. Show all posts

NSA, data uncertainty, and the problem of separating bad ideas from good ones

Citizenfour” is a movie about NSA whistleblower Edward Snowden made by journalist Laura Poitras, one of the two journalists Snowden contacted. It’s a gripping, professional, thought-provoking movie that everyone should consider watching. But this is especially so because at a deeper level, I think it goes to the heart not just of government surveillance but also of the whole problem of picking useful nuggets of data in the face of an onslaught of potential dross. In fact even the very process of classifying data as “nuggets” or “dross” is fraught with problems.

I was reminded of this problem as my mind went back to a piece on Edge.org by noted historian of technology George Dyson in which he takes government surveillance to task, not just on legal or moral grounds but on basic technical ones. Dyson’s concern is simple; when you are trying to identify that nugget of a dangerous idea from the morass of ideas out there, you are as likely to snare creative, good ideas in your net as bad ones. This may lead to a situation rife with false positives where you routinely flag – and, if everything goes right with your program, try to suppress – the good ideas. The problem arises partly because you don’t need to, and in fact cannot, flag every idea as “dangerous” or “safe” with one hundred percent accuracy; all you need to do is to get a rough idea.

“The ultimate goal of signals intelligence and analysis is to learn not only what is being said, and what is being done, but what is being thought. With the proliferation of search engines that directly track the links between individual human minds and the words, images, and ideas that both characterize and increasingly constitute their thoughts, this goal appears within reach at last. “But, how can the machine know what I think?” you ask. It does not need to know what you think—no more than one person ever really knows what another person thinks. A reasonable guess at what you are thinking is good enough.”

And when you are trying to get a rough idea, especially pertaining to someone’s complex thought processes, there’s obviously a much higher chance of making a mistake and failing the discrimination test.
The problem of separating the wheat from the chaff is encountered by every data analyst: for example, drug hunters who are trying to identify ‘promiscuous’ molecules – molecules which will indiscriminately bind to multiple proteins in the body and potentially cause toxic side effects – have to sift through lists of millions of molecules to find the right ones. They do this using heuristic rules of thumb which tell them what kinds of molecular structures have been promiscuous in the past. But the past cannot foretell the future, partly because the very process of defining these molecules is sloppy and inevitably captures a lot of ‘good’, non-promiscuous, perfectly druglike compounds. This problem with false positives applies to any kind of high-throughput process founded on empirical rules of thumb; there are always bound to be several exceptions. The same problem applies when you are trying to sift through millions of snippets of DNA and assigning causation or even correlations between specific genes and diseases.
What’s really intriguing about Dyson’s objection though is that it appeals to a very fundamental limitation in accomplishing this discrimination, one that cannot be overcome even by engaging the services of every supercomputer in the world.
Alan Turing jump-started the field of modern computer science when he proved that even an infinitely powerful algorithm cannot determine whether an arbitrary string of code represents a provable statement (the so-called ‘Decision Problem’ articulated by David Hilbert). Turing provided to the world the data counterpart of Kurt Gödel’s Incompleteness Theorem and Heisenberg’s Uncertainty Principle; there is code whose truth or lack thereof can only be judged by actually running it and not by any preexisting test. Similarly Dyson contends that the only way to truly distinguish good ideas from bad is to let them play out in reality. Now nobody is actually advocating that every potentially bad idea should be allowed to play out, but the argument does underscore the fundamental problem with trying to pre-filter good ideas from bad ones. As he puts it:
“The Decision Problem, articulated by Göttingen’s David Hilbert, concerned the abstract mathematical question of whether there could ever be any systematic mechanical procedure to determine, in a finite number of steps, whether any given string of symbols represented a provable statement or not.

The answer was no. In modern computational terms (which just happened to be how, in an unexpected stroke of genius, Turing framed his argument) no matter how much digital horsepower you have at your disposal, there is no systematic way to determine, in advance, what every given string of code is going to do except to let the codes run, and find out. For any system complicated enough to include even simple arithmetic, no firewall that admits anything new can ever keep everything dangerous out…

There is one problem—and it is the Decision Problem once again. It will never be entirely possible to systematically distinguish truly dangerous ideas from good ones that appear suspicious, without trying them out. Any formal system that is granted (or assumes) the absolute power to protect itself against dangerous ideas will of necessity also be defensive against original and creative thoughts. And, for both human beings individually and for human society collectively, that will be our loss. This is the fatal flaw in the ideal of a security state.”

In one sense this problem is not new since governments and private corporations alike have been trying to separate and suppress what they deem to be dangerous ideas for centuries; it’s a tradition that goes back to book burning in medieval times. But unlike a book which you can at least read and evaluate, the evaluation of ideas based on snippets, indirect connections, Google links and metadata is tenuous at best and wildly unlikely to accurately succeed. That is the fundamental barrier that agencies who are trying to determine thoughts and actions based on Google searches and Facebook profiles are facing, and it is likely that no amount of sophisticated computing power and data will enable them to solve the general problem.
Ultimately whether it’s government agencies, drug hunters, genomics experts or corporations, the temptation to fall prey to what writer Evgeny Morozov calls “technological solutionism” – the belief that key human problems will succumb to the latest technological advances – can be overpowering. But when you are dealing with people’s lives you need to be a little more wary of technological solutionism than when you are dealing with the latest household garbage disposal appliance or a new app to help you find nearby ice cream places. There is not just a legal and ethical imperative but a purely scientific one to treat data with respect and to disabuse yourself of the notion that you can completely understand it if only you threw more manpower, computing power and resources at it. A similar problem awaits us in the application of computation to problems in biotechnology and medicine.

At the end of his piece Dyson recounts a conversation he had with Herbert York, a powerful defense establishment figure who designed nuclear weapons, advised presidents and oversaw billions of dollars in defense and scientific funding. York cautions us to be wary of not just Eisenhower’s famed military-industrial complex but of the scientific-technological complex that has aligned itself with the defense establishment for the last fifty years. With the advent of massive amounts of data this alignment is honing itself into an entity that can have more power on our lives than ever before. At the same time we have never been in greater need of the scientific and technological tools that will allow us to make sense of the sea of data that engulfs. And that, as York says, is precisely the reason why we need to beware of it.
“York understood the workings of what Eisenhower termed the military-industrial complex better than anyone I ever met. “The Eisenhower farewell address is quite famous,” he explained to me over lunch. “Everyone remembers half of it, the half that says beware of the military-industrial complex. But they only remember a quarter of it. What he actually said was that we need a military-industrial complex, but precisely because we need it, beware of it. Now I’ve given you half of it. The other half: we need a scientific-technological elite. But precisely because we need a scientific-technological elite, beware of it. That’s the whole thing, all four parts: military-industrial complex; scientific-technological elite; we need it, but beware; we need it but beware. It’s a matrix of four.”

It’s a lesson that should particularly be taken to heart in industries like biotechnology and pharmaceuticals, where standard, black-box computational protocols are becoming everyday utilities of the trade. Whether it’s the tendency to push a button to launch a nuclear war or to sort drugs from non-drugs in a list of millions of candidates, temptation and disaster both await us at the other end.

Biologists, chemists, math and computing

Here are some words of wisdom from C. Titus Brown, a biology professor at Michigan State University, on the critical importance of quantitative math, stats and computing skills in biology. The larger question is what it means to 'do biology' in an age when biology has become so interdisciplinary. 

Brown's post itself is based on another post by Sean Eddy at Janelia Farm on how biologists must learn scripting these days. Here's Brown:
So here's my conclusion: to be a biologist, one must be seriously trying to study biology. Period. Clearly you must know something about biology in order to be effective here, and critical thinking is presumably pretty important there; I think "basic competency in scientific practice" is probably the minimum bar, but even there you can imagine lab techs or undergraduates putting in useful work at a pretty introductory level here. I think there are many useful skills to have, but I have a hard time concluding that any of them are strictly necessary. 
The more interesting question, to my mind, is what should we be teaching undergraduates and graduate students in biology? And there I unequivocally agree with the people who prioritize some reasonable background in stats, and some reasonable background in data analysis (with R or Python - something more than Excel). What's more important than teaching any one thing in specific, though, is that the whole concept that biologists can avoid math or computing in their training (be it stats, modeling, simulation, programming, data science/data analysis, or whatever) needs to die. That is over. Dead, done, over.
And here's Eddy:
The most important thing I want you to take away from this talk tonight is that writing scripts in Perl or Python is both essential and easy, like learning to pipette. Writing a script is not software programming. To write scripts, you do not need to take courses in computer science or computer engineering. Any biologist can write a Perl script. A Perl or Python script is not much different from writing a protocol for yourself. The way you get started is that someone gives you one that works, and you learn how to tweak it to do more of what you need. After a while, you’ll find you’re writing your own scripts from scratch. If you aren’t already competent in Perl and you’re dealing with sequence data, you really need to take a couple of hours and write your first Perl script.  
The thing is, large-scale biological data are almost as complicated as the organism itself. Asking a question of a large, complex dataset is just like doing an experiment.  You can’t see the whole dataset at once; you have to ask an insightful question of it, and get back one narrow view at a time.  Asking insightful questions of data takes just as much time, and just as much intuition as asking insightful questions of the living system.  You need to think about what you’re asking, and you need to think about what you’re going to do for positive and negative controls. Your script is the experimental protocol. The amount of thought you will find yourself putting into many different experiments and controls vastly outweighs the time it will take you to learn to write Perl.
And they are both right. The transformation of biology in the last thirty years or so into a data (especially sequencing data) and information intensive discipline means that it's no longer optional to be able to use the tools of information technology, math and computing to biological problems. You either use the tools yourself or you collaborate with someone who knows how to use them. Collaboration has of course also become much easier in the last thirty years, but it's easiest to do it yourself if your goal is simply to write some small but useful scripts for manipulating data. R and Python/Perl are both usually necessary but also sufficient for such purposes. Both Eddy and Brown also raise important questions about including such tools in the traditional education of undergraduate and graduate students in biology, something that's not a standard practice yet.
What I find interesting and rather ironic is that for most of its existence biology was a highly descriptive and experimental science, the opposite of math, which seemed to have nothing to do with mathematical analysis. This changed partially with the advent of genetic analysis in the early twentieth century, but it did not really hit the molecular basis of biology until the growth of sequencing tools.
Another irony is that chemistry is usually considered to be more data intensive and mathematical than biology, especially in the form of subfields like physical chemistry. And yet today we don't really find the average organic or inorganic chemist facing a need to learn and apply scripting or statistical analysis, although computational chemists have of course long since recognized the value of cheminformatics and data analysis. Compared to this the average biologist has become much more dependent on such tools.
However I think that this scenario in chemistry is changing. Organic synthesis may not be information intensive right now but automated organic synthesis - which I personally think is going to be revolutionary at some point in the future - will undoubtedly lead to large amounts of data which will have to be analyzed and statistically made sense of. One specific area of chemistry where I think this will almost certainly happen is in the application of 'accelerated serendipity' which involves the screening of catalysts and the correlation of molecular properties with catalytic activity. Some chemists are also looking at network theory to calculate optimal synthetic paths. At some point to use these tools effectively, the average chemist will have to borrow tools from the computational chemist, and perhaps chemistry will have become as data-intensive as biology. It's never too early to pick up that Python book or look at that R course.