‘Citizenfour” is a movie
about NSA whistleblower Edward Snowden made by journalist Laura Poitras, one of the two
journalists Snowden contacted. It’s a gripping,
professional, thought-provoking movie that everyone should consider watching.
But this is especially so because at a deeper level, I think it goes to the
heart not just of government surveillance but also of the whole problem of
picking useful nuggets of data in the face of an onslaught of potential dross.
In fact even the very process of classifying data as “nuggets” or “dross” is
fraught with problems.
I was reminded of this problem as my mind went back to a piece on Edge.org by noted historian of technology George Dyson in
which he takes government surveillance to task, not just on legal or moral
grounds but on basic technical ones. Dyson’s concern is simple; when you are
trying to identify that nugget of a dangerous idea from the morass of ideas out
there, you are as likely to snare creative, good ideas in your net as bad ones.
This may lead to a situation rife with false positives where you routinely flag
– and, if everything goes right with your program, try to suppress – the good
ideas. The problem arises partly because you don’t need to, and in fact cannot,
flag every idea as “dangerous” or “safe” with one hundred percent accuracy; all
you need to do is to get a rough idea.
“The ultimate goal of signals intelligence and analysis is to
learn not only what is being said, and what is being done, but what is being
thought. With the proliferation of search engines that directly track the links
between individual human minds and the words, images, and ideas that both
characterize and increasingly constitute their thoughts, this goal appears
within reach at last. “But, how can the machine know what I think?” you ask. It
does not need to know what you think—no more than one person ever really knows
what another person thinks. A reasonable guess at what you are thinking is good
enough.”
And when you are
trying to get a rough idea, especially pertaining to someone’s complex thought
processes, there’s obviously a much higher chance of making a mistake and
failing the discrimination test.
The problem of separating
the wheat from the chaff is encountered by every data analyst: for example,
drug hunters who are trying to identify ‘promiscuous’ molecules – molecules
which will indiscriminately bind to multiple proteins in the body and potentially cause
toxic side effects – have to sift through lists of millions of molecules to
find the right ones. They do this using heuristic rules of thumb which tell
them what kinds of molecular structures have been promiscuous in the past. But
the past cannot foretell the future, partly because the very process of
defining these molecules is sloppy and inevitably captures a lot of ‘good’,
non-promiscuous, perfectly druglike compounds. This problem with false
positives applies to any kind of high-throughput process founded on empirical rules
of thumb; there are always bound to be several exceptions. The same problem
applies when you are trying to sift through millions of snippets of DNA and
assigning causation or even correlations between specific genes and diseases.
What’s really intriguing
about Dyson’s objection though is that it appeals to a very fundamental
limitation in accomplishing this discrimination, one that cannot be overcome
even by engaging the services of every supercomputer in the world.
Alan Turing
jump-started the field of modern computer science when he proved that even an
infinitely powerful algorithm cannot determine whether an arbitrary string of
code represents a provable statement (the so-called ‘Decision Problem’
articulated by David Hilbert). Turing provided to the world the data
counterpart of Kurt Gödel’s Incompleteness Theorem and Heisenberg’s Uncertainty
Principle; there is code whose truth or lack thereof can only be judged by
actually running it and not by any preexisting test. Similarly Dyson contends
that the only way to truly distinguish good ideas from bad is to let them play
out in reality. Now nobody is actually advocating that every potentially bad
idea should be allowed to play out, but the argument does underscore the
fundamental problem with trying to pre-filter good ideas from bad ones. As he
puts it:
“The Decision Problem, articulated by Göttingen’s David Hilbert,
concerned the abstract mathematical question of whether there could ever be any
systematic mechanical procedure to determine, in a finite number of steps,
whether any given string of symbols represented a provable statement or not.
The answer was no. In modern computational terms (which just
happened to be how, in an unexpected stroke of genius, Turing framed his
argument) no matter how much digital horsepower you have at your disposal,
there is no systematic way to determine, in advance, what every given string of
code is going to do except to let the codes run, and find out. For any system
complicated enough to include even simple arithmetic, no firewall that admits
anything new can ever keep everything dangerous out…
There is one problem—and it is the Decision Problem once again. It
will never be entirely possible to systematically distinguish truly dangerous
ideas from good ones that appear suspicious, without trying them out. Any
formal system that is granted (or assumes) the absolute power to protect itself
against dangerous ideas will of necessity also be defensive against original
and creative thoughts. And, for both human beings individually and for human
society collectively, that will be our loss. This is the fatal flaw in the
ideal of a security state.”
In one sense this
problem is not new since governments and private corporations alike have been
trying to separate and suppress what they deem to be dangerous ideas for
centuries; it’s a tradition that goes back to book burning in medieval times.
But unlike a book which you can at least read and evaluate, the evaluation of
ideas based on snippets, indirect connections, Google links and metadata is
tenuous at best and wildly unlikely to accurately succeed. That is the
fundamental barrier that agencies who are trying to determine thoughts and
actions based on Google searches and Facebook profiles are facing, and it is
likely that no amount of sophisticated computing power and data will enable
them to solve the general problem.
Ultimately whether it’s government agencies, drug hunters,
genomics experts or corporations, the temptation to fall prey to what writer
Evgeny Morozov calls “technological
solutionism” – the belief that key human problems will succumb to the latest
technological advances – can be overpowering. But when you are dealing with
people’s lives you need to be a little more wary of technological solutionism
than when you are dealing with the latest household garbage disposal appliance or a new app to help
you find nearby ice cream places. There is not just a legal and ethical
imperative but a purely scientific one to treat data with respect and to
disabuse yourself of the notion that you can completely understand it if only
you threw more manpower, computing power and resources at it. A similar problem awaits us in the application of computation to problems in biotechnology and medicine.
At the end of his
piece Dyson recounts a conversation he had with Herbert York, a powerful
defense establishment figure who designed nuclear weapons, advised presidents
and oversaw billions of dollars in defense and scientific funding. York
cautions us to be wary of not just Eisenhower’s famed military-industrial complex
but of the scientific-technological complex that has aligned itself with the
defense establishment for the last fifty years. With the advent of massive
amounts of data this alignment is honing itself into an entity that can have
more power on our lives than ever before. At the same time we have never been
in greater need of the scientific and technological tools that will allow us to
make sense of the sea of data that engulfs. And that, as York says, is
precisely the reason why we need to beware of it.
“York understood the workings of what Eisenhower termed the
military-industrial complex better than anyone I ever met. “The Eisenhower
farewell address is quite famous,” he explained to me over lunch. “Everyone
remembers half of it, the half that says beware of the military-industrial
complex. But they only remember a quarter of it. What he actually said was that
we need a military-industrial complex, but precisely because we need it, beware
of it. Now I’ve given you half of it. The other half: we need a
scientific-technological elite. But precisely because we need a
scientific-technological elite, beware of it. That’s the whole thing, all four
parts: military-industrial complex; scientific-technological elite; we need it,
but beware; we need it but beware. It’s a matrix of four.”
It’s a lesson that should particularly be taken to heart in
industries like biotechnology and pharmaceuticals, where standard,
black-box computational protocols are becoming everyday utilities of the trade.
Whether it’s the tendency to push a button to launch a nuclear war or to sort
drugs from non-drugs in a list of millions of candidates, temptation and disaster
both await us at the other end.
The process of turning data into information--actionable, decision-supporting information--has had a more thorough inspection at the CIA than most people think. The work of Richards Heuer comes to mind; his "Psychology of Intelligence Analysis" is a valuable read. In it, he catalogs the cognitive biases and quirks that come into play in decision-making. It's a more sophisticated and self-aware review than I've seen from the overwhelming majority of scientist/managers I've worked with.
ReplyDeleteThanks for the interesting Heuer reference; will check it out.
Delete