Field Notes

When Not to Use Machine Learning in Hydrology

A rainfall–runoff LSTM tracks in-sample, then diverges past the end of record — the post’s central point, animated.

Most of what you read about machine learning in hydrology is a celebration. A new LSTM beats a conceptual model on a benchmark, a transformer fills a gap, and the conclusion writes itself: the field is being transformed. I find this strange, because I do both. I train rainfall–runoff LSTMs, and I build Markov-chain stochastic streamflow simulators. The two live on the same laptop. So let me write the post I never see — not why to use ML, but when not to.

This isn’t an argument against machine learning. It’s an argument against using it as the default. ML is a tool with a shape, and water problems have shapes too. When they don’t match, the elegant thing is to reach for something else.

1. When you have to extrapolate into the tail

The floods that matter are, by definition, the ones you haven’t seen. A 100-year or 500-year design flood lives in the tail of a distribution, often beyond the largest event in the record. Machine learning interpolates beautifully and extrapolates terribly: an LSTM has no notion of a physical upper bound or of how a heavy tail behaves outside its training range. A fitted extreme-value distribution — GEV, Log-Pearson III, a generalised Pareto fit to peaks over threshold — is built for that tail. It encodes the asymptotic behaviour you need precisely where you have no data. For design flows, low-flow statistics like the 7Q10, or any quantile past the record, I will take a defensible distribution over a neural network every time.

2. When the record is short or non-stationary

ML is data-hungry. A great many gauges have fewer than twenty years of record, and the interesting ones are often non-stationary on top of that. A model with tens of thousands of parameters will happily memorise a short, drifting series and tell you nothing about the process. This is exactly where stochastic methods earn their keep: they encode structure — autocorrelation, seasonality, regime transitions, heavy-tailed pulses — and generate ensembles from limited records. My own streamflow work clusters patterns into states, walks a Markov chain over them, and adds kappa/GEV pulses to reproduce extremes from a thousand synthetic realisations. You can’t get a thousand plausible futures out of a single deterministic forecast.

3. When the answer has to be defensible

A flood study that goes to a regulator, a utility, or a courtroom needs a mechanism a human can defend line by line. “The network said so” is not a defence. Conceptual and statistical models fail here in legible ways: you can point at the routing, the loss function, the distribution, the assumption. Black boxes fail in illegible ones. If the deliverable is a decision someone will be held accountable for, interpretability isn’t a nice-to-have, it’s the product.

4. When the question is about risk, not a point

Ask yourself whether you actually want a number or a distribution. A vanilla LSTM gives you a point prediction; risk lives in the spread. You can get uncertainty out of ML — quantile regression, conformal prediction, deep ensembles — but it takes deliberate work, and many published pipelines skip it. A stochastic ensemble gives you the distribution natively: that’s the whole point of generating a thousand traces. If your decision is a probability of exceedance, start from a method that speaks in probabilities.

5. When a cheap physical model already knows the answer

Mass balance, hydraulic routing, a unit hydrograph — sometimes the physics is cheap, well understood, and sufficient. Throwing a GPU at a problem that a conservation equation solves in closed form is cost without insight.

So when is it right?

Plenty of times, and I use it for them. Short-horizon forecasting and nowcasting, where there is signal and recent data. Spatial gap-filling and downscaling. Surrogate models that emulate an expensive simulation a million times. And regionalisation across many catchments, where a single model can learn shared behaviour — my season-adaptive mixture-of-experts work routes 242 Indian catchments and thousands of drought events through specialised predictors, because that is a pattern-rich, data-rich problem with the right shape.

The honest position is hybrid. Use ML where it is strong and statistics or physics where they are strong, and be specific about which is which. The useful question is never “ML or not.” It is: what is the smallest model that answers this question with uncertainty I can defend? Often that is a neural network. Often it is a probability distribution that has been doing the job since before any of us were born. Knowing the difference is the skill.