← Writing

Why we avoided ML for behavioral anomaly detection

At MVP scale, per-household ML models would be underfit and hard to interpret. Why statistical baselines were the right call for a safety-critical eldercare system.

Anomaly DetectionProduction AISystems Design
2025-06-01·3 min read

When building Perch — a passive behavioral monitoring platform for elderly care — one of the earliest architectural decisions was whether to reach for machine learning or not.

The answer was no. Here's why that was the right call, and when that answer changes.

The temptation

The problem looks like an ML problem on the surface. You have time-series sensor data, you want to detect anomalies, and anomaly detection is a well-studied ML domain. LSTM autoencoders, isolation forests, and one-class SVMs exist precisely for this kind of task.

But framing a problem as an ML problem because ML techniques exist for it is a category error.

The constraints that changed the answer

Low per-household data volume. At MVP scale, each household generates a sparse stream of motion and door events. You might see dozens of events per day — enough to build a statistical picture, not enough to train a model that generalizes.

High variance between households. A retired person who sleeps until 9am looks nothing like a retired person who's up at 5am. Any model trained across households would need to disentangle between-person variance from within-person anomaly. That's a harder problem than it looks, and it requires more labeled data than we had.

Interpretability is a product requirement, not a nice-to-have. When an alert fires — "no motion detected for 6 hours" — a caregiver needs to understand why. A deviation from a rolling baseline is explainable in plain language. A neural network reconstruction error is not. In safety-critical systems, interpretability isn't optional.

False negatives carry real cost. An underfit model on sparse data will miss things. In eldercare, missing a genuine inactivity event has consequences. Statistical baselines with explicit thresholds give you direct control over sensitivity in a way that model outputs don't.

What we built instead

Each household builds its own baseline from a rolling 14-day window. The baseline captures expected motion density per hour-of-day bucket, average wake time, average sleep time, and door event frequency.

The evaluator runs every 5 minutes and flags deviations beyond a configurable threshold. Every alert is accompanied by an evidence snapshot: the raw sensor events that triggered it.

This is deliberately simple. The simplicity is the point.

When ML becomes the right answer

Statistical baselines aren't the right answer forever. ML becomes appropriate when:

  • Per-household data volume grows to a point where individual models can be properly fit
  • Cross-household patterns become learnable and generalization is validated
  • The product has evaluation infrastructure to safely validate model behavior before deployment
  • The interpretability requirement can be met through model-specific techniques like SHAP or attention visualization

The mistake isn't using ML. The mistake is using ML before the data and evaluation infrastructure can support it.

The broader pattern

Production AI systems are often more about data quality, reliability, and evaluation than models. The model is usually the easiest part. The hard parts are: getting clean data, handling edge cases, building evaluation that catches regressions, and making the system trustworthy enough for real-world use.

In safety-critical domains especially, start with the simplest thing that works and build the evaluation infrastructure to know when you can safely add complexity.