Imagine being an amateur endurance athletes that competes. In this day and age, you might feel encouraged to monitor your performance and other metrics to work out how to improve and get over the line first. Below is data from a recent workout of mine (aggregated every kilometre) and it is reporting averaged physical quantities X and Y for each data point. X is what you want to improve to get over the line first/as fast as possible. I could tell you what X and Y stand for, but that would ruin the fun!
Putting this data through a linear regression (the hammer of data analysis) gives us a correlation coefficient of 0.60 meaning they are negatively correlated and a coefficient of determination—also known as “R squared”—of 0.36, meaning Y explains a decent chunk of X, but not all of it.^{1}
So our conclusion based on this data analysis would be: to improve X, we should be focussing on reducing Y. X represents cycling speed (in km/h) and Y represents power output (in watts), so we are saying that to improve cycling speed, we should be focussing on reducing power output—that isn’t right, at all—in fact, more power means more speed from rudimentary physics. So—how did we end up with this conclusion?
There are no nefarious forces at work: I did not change bikes, applied the brakes beyond crossings or traverse different road surfaces nor did I cherrypick this workout or synthesized it by combining multiple ones. In fact, I would say you would see the this in a significant number of road bike workouts done outdoors on Strava that have power meter data. We ended up here because we did not take into account confounding variables like the gradient (up/down hills) and wind.
You might think that this a toy example, that this does not happen in real life, or that peer review would catch it, but that is not (always) the case. If you simplify the above data into 3 buckets (flat, uphill, downhill) it becomes an example of Simpson’s paradox “in which a trend appears in several different groups of data but disappears or reverses when these groups are combined” and does sometimes sneak through—see for example this article (the kidney stone treatment example was my first exposure to this phenomenon).
For amateur endurance athletes bad use of statistics can have real consequences when data, from races and/or workouts are incorrectly interpreted and feed into decisions made: but that’s for another time! Be wary next time your mate has “proved” that common sense/physics is wrong by using a linear regression or see a headline that says “X causes cancer” as often it is not simple as that, or in fact, the reverse is true… Statistics, here be dragons.

Later, when changing the table to a scatter plot, I realized there were some outliers (which, if you read on, makes sense, given where the data comes from!). Excluding the outliers (X outside [20, 30]), one arrives at a correlation coefficient of 0.079 and 0.006, so more of a “neutral” result instead of a “opposite” result. However, one should not simply throw out outliers without knowing the context, and given the context, it would then make no sense to proceed, so the point still stands.” ↩