Products

Solutions

Resources

Get a Demo

Nearmap AI vs Claude Fable: Benchmark Results

Jun 2026

Dr. Michael Bewley, Vice President, AI

Jun 2026

Dr. Michael Bewley, Vice President, AI

As foundation AI models have matured, a fair question has emerged across the property intelligence industry: if Gemini, GPT, and Claude can reason about almost anything, do purpose-built AI models still compare?

Nearmap ran the test.

The benchmark: pool detection as a real-world proxy

Pool detection is a real-world property intelligence task. One used across industries to make decisions about risk, compliance, and asset value.

Nearmap AI Data Layers gif of US swimming pool detections

Detecting a specific object class at scale is challenging given the diversity of aerial conditions: varying light, shadow, occlusion, and differing surface materials. We tested four models on the same aerial imagery dataset, and in June 2026, rapidly added two more when Anthropic released Claude Opus 4.8 and Claude Fable 5. We evaluated performance using F1 score: a standard metric that balances precision (avoiding false positives) and recall (avoiding false negatives).

Both errors carry real-world cost. A false negative is a missed feature: a gap in the data that a decision gets made on. A false positive is a misleading one one: wasted time, inaccurate outputs, or eroded trust in the model.

The results

All generalist models were tested using high-resolution Nearmap imagery, maximum thinking mode enabled, and a detailed prompt-level definition of a swimming pool. Same conditions, same dataset, same prompt. The gap held across every model.

F1 score on pool presence

Methodology note: Results are based on Nearmap internal testing. The original study was conducted March 2026 across 2,500 US residential properties. In June 2026, the study was updated under identical protocol to include Claude Opus 4.8 and Claude Fable 5. Fable 5 was evaluated during its availability window prior to suspension under a US government export control directive. Performance may vary based on imagery conditions, dataset composition, and model version.

Read the full report

See the results up close across Claude, Gemini, and Nearmap AI →

Where generalist AI struggles

False negatives

We found that pools in deep shadow or partially concealed by overhanging vegetation or roof structures led to false negatives, Gemini 3.1 Pro and both Claude models still missed them. Nearmap AI Gen 6 caught them, as it’s a model trained on purpose-captured aerial imagery across millions of properties. This helps to calibrate even those cases on the edge.

False positives

Lawns, certain pool covers, light-coloured paving. Gemini 3.1 Pro and both Claude models flagged several as pools. Claude Fable 5 — Anthropic’s most powerful model — showed the highest false positive rate of any model tested. Nearmap AI Gen 6 did not. The visual signatures that distinguish these surfaces in aerial imagery are precise and learned from millions of labelled examples captured from a consistent vantage point. They aren’t inferred by general language training that other models operate on.

The root cause is the same in both directions. Foundation models reason flexibly across an enormous range of input. It’s that flexibility that is a standout feature. But it also means they haven’t been calibrated for the specific visual vocabulary of aerial imagery at a 5–7.5cm resolution.

The throughput reality

Accuracy alone doesn’t determine whether AI is operationally viable, speed contributes too. In our testing, Claude Opus 4.8 averaged 23 seconds per property. Claude Fable 5 averaged 49 seconds, more than double, due to its default reasoning mode. Gemini 3.1 Pro, the most accurate generalist model, was also the laggiest at 98 seconds per property. All are rate-limited to a handful of properties per second at scale.

Nearmap AI Gen 6 operates at sub-second inference, processing more than 1,000 properties per second.

What does that mean for your industry?

AI for Insurance

For property insurers, pool detection accuracy is a risk pricing issue. A missed pool is a gap in your risk view, or a policy priced without full property information. A falsely flagged one triggers a wasted inspection workflow or an inaccurate quote sent to a customer. Nearmap AI Gen 6’s 98.5% F1 accuracy, processing more than 1,000 properties per second, means more accurate renewals at portfolio scale, with less manual review overhead.

Specialist vs. generalist: a question of fit

Foundation models are powerful, and we see them improving daily. For tasks requiring flexible reasoning across unstructured inputs, like document extraction, open-ended analysis, and multimodal interpretation of varied content, general-purpose AI is often the right tool.

For defined property intelligence tasks, the ceiling for specialist models is set by three things:

The clarity of the task definition

The quality of the training data

The stability of the model over time

Nearmap owns and controls all three. Nearmap AI is trained on proprietary imagery captured specifically for property intelligence, with a consistent resolution, consistent geometry, and multiple captures per year. It updates on a controlled release cycle tested against property intelligence benchmarks before anything ships.

A version change for a foundation model optimised for general capability can shift performance on a specific task significantly, in either direction, with no changelog entry covering your use case. Teams building workflows on general-purpose AI inherit that unpredictability.

In June 2026, this stopped being theoretical. Claude Fable 5 was released and suspended within 72 hours under a US government directive. Any workflow built on that model had no warning, no changelog entry, and no recourse. Purpose-built AI on a controlled release cycle doesn’t carry that risk.

What to ask when evaluating AI for property intelligence

If you’re evaluating AI for property intelligence, the numbers above are the conversation worth having. Here’s a checklist to help you procure the solution that works in line with your strategy:

Ask for F1 scores (precision and recall) against a validated ground truth — which ranked highest?

Ask for demonstrated throughput at portfolio scale

Ask what the training data looked like, and whether it was captured for this purpose or assembled from general sources

Ask if it can identify the class of assets that matter most to your work

The answers will tell you whether you’re looking at purpose-built property intelligence, or a capable general model working outside its area of calibration. For decisions that affect underwriting accuracy, risk assessment, and CAT response, that difference is worth understanding before you build it into your workflow.

Disclaimer: This article is for informational purposes. Results reflect Nearmap internal testing and may vary by use case, dataset, and imagery conditions. Readers are encouraged to conduct their own evaluation before making procurement decisions.

Complete property intelligence, powered by Nearmap AI

Start today. No waitlist.

Explore Property Intelligence

Nearmap AI vs Claude Fable: Benchmark Results

Jun 2026

Jun 2026

The benchmark: pool detection as a real-world proxy

The results

F1 score on pool presence

Read the full report

Where generalist AI struggles

False negatives

False positives

The throughput reality

Nearmap AI Gen 6 operates at sub-second inference, processing more than 1,000 properties per second.

What does that mean for your industry?

AI for Insurance

Specialist vs. generalist: a question of fit

What to ask when evaluating AI for property intelligence

Complete property intelligence, powered by Nearmap AI

Applications

Data & Insights

Solutions

Company

Support

Connect