Want to know if you are depressed? Don't ask Siri just yet.

"Tell me more about your baseline calibration, Siri"
There's no dearth of articles claiming that the "wearables revolution" is around the corner and that we aren't far from the day when every aspect of our health is recorded every second, analyzed and sent to the doctor for rapid diagnosis and treatment. That's why it was especially interesting for me to read this new analysis from computer scientists at Berkeley and Penn that should temper the soaring enthusiasm that riddles pretty much all things "AI" these days.

The authors are asking a very simple question in the context of machine learning (ML) algorithms that claim to predict your mood - and by proxy mental health issues like depression - based on GPS and other data. What's this simple question? It's one about baselines. When any computer algorithm makes a prediction, one of the key questions is how much better this prediction is compared to some baseline. Another name for baselines is "null models". Yet another is "controls", although controls themselves can be artificially inflated. 

In this case the baseline can be of two kinds: personal baselines (self-reported individual moods) or population baselines (the mood of a population). What the study finds is not too pretty. They analyze a variety of literature on mood-reporting ML algorithms and find that in about 77% of cases the studies use meaningless baselines that overestimate the performance of the ML models with respect to predicting mood swings. The reason is because the baselines that are used in most studies are population baselines rather than the more relevant personal baselines. The population baseline assumes a constant average state for all individuals, while the individual baseline assumes an average state for every individuals but different states between individuals. 

Clearly doing better than the population baseline is not very useful for tracking individual mood changes, and this is especially true since the authors find greater errors for population baselines compared to individual ones; these larger errors can simply obscure model performance. The paper also consider two datasets and try to figure out how to improve the performance of models on these datasets using a metric which they call "user lift" that determines how much better the model is compared to the baseline. 

I will let the abstract speak for itself:

"A new trend in medicine is the use of algorithms to analyze big datasets, e.g. using everything your phone measures about you for diagnostics or monitoring. However, these algorithms are commonly compared against weak baselines, which may contribute to excessive optimism. To assess how well an algorithm works, scientists typically ask how well its output correlates with medically assigned scores. Here we perform a meta-analysis to quantify how the literature evaluates their algorithms for monitoring mental wellbeing. We find that the bulk of the literature (∼77%) uses meaningless comparisons that ignore patient baseline state. For example, having an algorithm that uses phone data to diagnose mood disorders would be useful. However, it is possible to over 80% of the variance of some mood measures in the population by simply guessing that each patient has their own average mood - the patient-specific baseline. Thus, an algorithm that just predicts that our mood is like it usually is can explain the majority of variance, but is, obviously, entirely useless. Comparing to the wrong (population) baseline has a massive effect on the perceived quality of algorithms and produces baseless optimism in the field. To solve this problem we propose “user lift” that reduces these systematic errors in the evaluation of personalized medical monitoring."

That statement about being able to explain 80% of the variance in the model simply by guessing an average  mood for every individual should stand out. It means that simple informed guesswork based on an average "feeling" is both as good as the model and is also eminently useless since it predicts no variability and is therefore of little practical utility.

I find this paper important because it should put a dent in what is often inflated enthusiasm about wearables these days. It also illustrates the dangers of what is called "technological solutionism": simply because you can strap on a watch or device on your body to measure various parameters and simply because you have enough computing power to analyze the resulting stream of data does not mean the results will be significant. You record because you can, you analyze because you can, you conclude because you can. What the authors find about tracking moods can apply to tracking other kinds of important variables like blood pressure and sleep duration. Every time the question must be; am I using the right baseline for comparison? And am I doing better than the baseline? Hopefully the authors can use larger and more diverse datasets and find out similar facts about other such metrics.

I also found this study interesting because it reminds me of a whole lot of valid criticism in the field of molecular modeling that we have seen over the last few years. One of the most important questions there is about null models. Whenever your latest and greatest FEP/MD/informatics/docking study is claimed to have done exceptionally well on a dataset, the first question should be; is it better than the null model? And have you defined the null model correctly to begin with? Is your model doing better than a simpler method? And if it's not, why use it, and why assign a causal connection between your technique and the relevant result?

In science there are seldom absolutes. Studies like this show us that every new method needs to be compared with what came before it. When old facts have already paved the way, new facts are compelled to do better. Otherwise they can create the illusion of doing well.

No comments:

Post a Comment

Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="http://www.fieldofscience.com/">FoS</a> = FoS