Biologists, chemists, math and computing

Here are some words of wisdom from C. Titus Brown, a biology professor at Michigan State University, on the critical importance of quantitative math, stats and computing skills in biology. The larger question is what it means to 'do biology' in an age when biology has become so interdisciplinary. 

Brown's post itself is based on another post by Sean Eddy at Janelia Farm on how biologists must learn scripting these days. Here's Brown:
So here's my conclusion: to be a biologist, one must be seriously trying to study biology. Period. Clearly you must know something about biology in order to be effective here, and critical thinking is presumably pretty important there; I think "basic competency in scientific practice" is probably the minimum bar, but even there you can imagine lab techs or undergraduates putting in useful work at a pretty introductory level here. I think there are many useful skills to have, but I have a hard time concluding that any of them are strictly necessary. 
The more interesting question, to my mind, is what should we be teaching undergraduates and graduate students in biology? And there I unequivocally agree with the people who prioritize some reasonable background in stats, and some reasonable background in data analysis (with R or Python - something more than Excel). What's more important than teaching any one thing in specific, though, is that the whole concept that biologists can avoid math or computing in their training (be it stats, modeling, simulation, programming, data science/data analysis, or whatever) needs to die. That is over. Dead, done, over.
And here's Eddy:
The most important thing I want you to take away from this talk tonight is that writing scripts in Perl or Python is both essential and easy, like learning to pipette. Writing a script is not software programming. To write scripts, you do not need to take courses in computer science or computer engineering. Any biologist can write a Perl script. A Perl or Python script is not much different from writing a protocol for yourself. The way you get started is that someone gives you one that works, and you learn how to tweak it to do more of what you need. After a while, you’ll find you’re writing your own scripts from scratch. If you aren’t already competent in Perl and you’re dealing with sequence data, you really need to take a couple of hours and write your first Perl script.  
The thing is, large-scale biological data are almost as complicated as the organism itself. Asking a question of a large, complex dataset is just like doing an experiment.  You can’t see the whole dataset at once; you have to ask an insightful question of it, and get back one narrow view at a time.  Asking insightful questions of data takes just as much time, and just as much intuition as asking insightful questions of the living system.  You need to think about what you’re asking, and you need to think about what you’re going to do for positive and negative controls. Your script is the experimental protocol. The amount of thought you will find yourself putting into many different experiments and controls vastly outweighs the time it will take you to learn to write Perl.
And they are both right. The transformation of biology in the last thirty years or so into a data (especially sequencing data) and information intensive discipline means that it's no longer optional to be able to use the tools of information technology, math and computing to biological problems. You either use the tools yourself or you collaborate with someone who knows how to use them. Collaboration has of course also become much easier in the last thirty years, but it's easiest to do it yourself if your goal is simply to write some small but useful scripts for manipulating data. R and Python/Perl are both usually necessary but also sufficient for such purposes. Both Eddy and Brown also raise important questions about including such tools in the traditional education of undergraduate and graduate students in biology, something that's not a standard practice yet.
What I find interesting and rather ironic is that for most of its existence biology was a highly descriptive and experimental science, the opposite of math, which seemed to have nothing to do with mathematical analysis. This changed partially with the advent of genetic analysis in the early twentieth century, but it did not really hit the molecular basis of biology until the growth of sequencing tools.
Another irony is that chemistry is usually considered to be more data intensive and mathematical than biology, especially in the form of subfields like physical chemistry. And yet today we don't really find the average organic or inorganic chemist facing a need to learn and apply scripting or statistical analysis, although computational chemists have of course long since recognized the value of cheminformatics and data analysis. Compared to this the average biologist has become much more dependent on such tools.
However I think that this scenario in chemistry is changing. Organic synthesis may not be information intensive right now but automated organic synthesis - which I personally think is going to be revolutionary at some point in the future - will undoubtedly lead to large amounts of data which will have to be analyzed and statistically made sense of. One specific area of chemistry where I think this will almost certainly happen is in the application of 'accelerated serendipity' which involves the screening of catalysts and the correlation of molecular properties with catalytic activity. Some chemists are also looking at network theory to calculate optimal synthetic paths. At some point to use these tools effectively, the average chemist will have to borrow tools from the computational chemist, and perhaps chemistry will have become as data-intensive as biology. It's never too early to pick up that Python book or look at that R course.

4 comments:

  1. Curious Wavefunction: I am a graduate student of synthetic organic chemistry, and have been learning more about the impact computers have and will have on scientific pursuits. I am seriously considering a switch away from organic hood chemistry to cheminformatics as a more future-valid career. Do you have any advice or speculation on the subject beyond the contents of the last paragraph?

    ReplyDelete
    Replies
    1. I would definitely recommend R and Python along with a cheminformatics toolkit like OEChem or RDKit. That combined with good basic statistical understanding should take care of 90% of cheminformatics problems I believe. Pipeline Pilot also has some great ways to combine cheminformatics functionality.

      Delete
  2. It's not a binary "computers or bench" issue. Don't limit yourself: you can certainly learn both. I'm doing this, it's not easy but some things are worth the trouble. Math and computation can be beautiful, but so can "doing stuff with your hands." Just focus on finding what you're passionate about and then solve that with a combination of tools that works for you.

    ReplyDelete
  3. Absolutely! It's just that this view which you are expounding is not yet as ingrained as it should be, especially in undergraduate and graduate education.

    ReplyDelete

Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="http://www.fieldofscience.com/">FoS</a> = FoS