Nearmap AI vs Gemini: What the benchmark data shows
Mar 2026
Dr. Michael Bewley, Vice President, AI
Mar 2026
Dr. Michael Bewley, Vice President, AI
As foundation AI models have matured, a fair question has emerged across the property intelligence industry: if Gemini, GPT, and Claude can reason about almost anything, do purpose-built AI models still compare?
Nearmap ran the test.
The benchmark: pool detection as a real-world proxy
Pool detection is a real-world property intelligence task. Insurers use it to assess risk and price renewals accurately. Local governments use it to identify undeclared pools, verify permits, and ensure properties are assessed at the right rate.
Detecting a specific object class at scale is challenging given the diversity of aerial conditions: varying light, shadow, occlusion, and differing surface materials. We tested three models on the same aerial imagery dataset, evaluating performance using F1 score: a standard metric that balances precision (avoiding false positives) and recall (avoiding false negatives).
Both errors carry cost for a range of industries:
In property insurance, a missed pool is a gap in your risk view. A falsely flagged one is a wasted inspection workflow, or an inaccurate quote.
For local government, a missed development is a gap in your compliance picture. A falsely flagged one is a wasted site investigation or an inaccurate rates assessment.
The results
Gemini was tested using high-resolution Nearmap imagery, maximum thinking mode enabled, and a detailed prompt-level definition of a swimming pool. No time pressure and the the gap still held.
F1 score on pool presence
Methodology note: Results are based on Nearmap internal testing conducted March 2026, using an aerial imagery dataset of 2,500 residential properties from the USA, hand-labelled and reviewed by human experts, containing 158 swimming pools. All models were tested against the same dataset under the conditions described. This testing was conducted for internal evaluation purposes. Performance may vary based on imagery conditions, dataset composition, and model version.
Where generalist AI struggles
False negatives
We found that pools in deep shadow or partially concealed by overhanging vegetation or roof structures led to false negatives, Gemini still missed them. Nearmap AI Gen 6 caught them, as it’s a model trained on purpose-captured aerial imagery across millions of properties. This helps to calibrate even those cases on the edge.
False positives
Lawns, certain pool covers, light-coloured paving. Gemini flagged several as pools. Nearmap AI Gen 6 did not. The visual signatures that distinguish these surfaces in aerial imagery are precise and learned from millions of labelled examples captured from a consistent vantage point. They aren’t inferred by general language training that other models operate on.
The root cause is the same in both directions. Foundation models reason flexibly across an enormous range of input. It’s that flexibility that is a standout feature. But it also means they haven’t been calibrated for the specific visual vocabulary of aerial imagery at a 5–7.5cm resolution.
The throughput reality
Accuracy alone doesn’t determine whether AI is operationally viable, speed contributes too. In our testing, Gemini processed each property in over seven seconds, and at scale it’s rate-limited to a handful of images per second.
Nearmap AI Gen 6 operates at sub-second inference, processing more than 1,000 properties per second.
For an insurer assessing a portfolio of 100,000 properties, this isn’t just a trivial difference. Accuracy that can’t operate at portfolio scale isn’t true property intelligence, but just a proof of concept.
Specialist vs. generalist: a question of fit
Foundation models are powerful, and we see them improving daily. For tasks requiring flexible reasoning across unstructured inputs, like document extraction, open-ended analysis, and multimodal interpretation of varied content, general-purpose AI is often the right tool.
For defined property intelligence tasks, the ceiling for specialist models is set by three things:
01The clarity of the task definition
02The quality of the training data
03The stability of the model over time
Nearmap owns and controls all three. Nearmap AI is trained on proprietary imagery captured specifically for property intelligence, with a consistent resolution, consistent geometry, and multiple captures per year. It updates on a controlled release cycle tested against property intelligence benchmarks before anything ships.
A version change for a foundation model optimized for general capability can shift performance on a specific task significantly, in either direction, with no changelog entry covering your use case.
Teams building workflows on general-purpose AI inherit that unpredictability. The performance of a specialist model is consistent because the scope is defined.
What to ask when evaluating AI for property intelligence
If you’re evaluating AI for property intelligence, the numbers above are the conversation worth having. Here’s a checklist to help you procure the solution that works in line with your strategy:
The answers will tell you whether you’re looking at purpose-built property intelligence, or a capable general model working outside its area of calibration. For decisions that affect underwriting accuracy, risk assessment, and CAT response, that difference is worth understanding before you build it into your workflow.
Disclaimer: This article is for informational purposes. Results reflect Nearmap internal testing and may vary by use case, dataset, and imagery conditions. Readers are encouraged to conduct their own evaluation before making procurement decisions.
Complete property intelligence, powered by Nearmap AI