Where generalist AI struggles
False negatives
We found that pools in deep shadow or partially concealed by overhanging vegetation or roof structures led to false negatives, Gemini 3.1 Pro and both Claude models still missed them. Nearmap AI Gen 6 caught them, as it’s a model trained on purpose-captured aerial imagery across millions of properties. This helps to calibrate even those cases on the edge.
False positives
Lawns, certain pool covers, light-coloured paving. Gemini 3.1 Pro and both Claude models flagged several as pools. Claude Fable 5 — Anthropic’s most powerful model — showed the highest false positive rate of any model tested. Nearmap AI Gen 6 did not. The visual signatures that distinguish these surfaces in aerial imagery are precise and learned from millions of labelled examples captured from a consistent vantage point. They aren’t inferred by general language training that other models operate on.
The root cause is the same in both directions. Foundation models reason flexibly across an enormous range of input. It’s that flexibility that is a standout feature. But it also means they haven’t been calibrated for the specific visual vocabulary of aerial imagery at a 5–7.5cm resolution.
The throughput reality
Accuracy alone doesn’t determine whether AI is operationally viable, speed contributes too. In our testing, Claude Opus 4.8 averaged 23 seconds per property. Claude Fable 5 averaged 49 seconds, more than double, due to its default reasoning mode. Gemini 3.1 Pro, the most accurate generalist model, was also the laggiest at 98 seconds per property. All are rate-limited to a handful of properties per second at scale.