0
%
0
T$
)
)
Nearmap AI Gen 6 | Claude Opus 4.8 | Claude Fable 5 | Gemini 3.1 Pro | |
|---|---|---|---|---|
Roof area — mean abs. error | 12 sqm | 66 sqm | 70 sqm | 45 sqm |
Roof count — exact match | 80% | 62% | 64% | 70% |
Roof condition — RSI mean abs. error | reference | 18 pts | 15 pts | 12 pts |
Pool presence — F1 | 98.5% | 80% | 77% | 86% |
)
)
)
Pool detection variation | Roof count variation | |
|---|---|---|
Nearmap AI Gen 6 | 0% | 0% |
Claude Opus 4.8 | 0.6% | 7.6% |
Claude Fable 5 | 3.0% | 18.1% |
Gemini 3.1 Pro | 3.0% | 20.8% |
)
Can you provide F1 scores against a validated ground truth dataset specific to the feature class we need to detect? What are the precision and recall figures separately?
In this study: 98.5% (Nearmap AI Gen 6) vs. 87% (Gemini 3.1 Pro), 80% (Claude Opus 4.8), and 77% (Claude Fable 5) on the same dataset. (see Section 3.1)
Was your training data purpose-captured for aerial property intelligence, or assembled from general image sources? How consistent is the imaging geometry, resolution, and sensor characteristics across your training set?
Purpose-captured imagery at consistent resolution drove the edge case performance gap. (see Section 4.1)
What are your demonstrated processing rates at portfolio scale, not sample scale? At what point do throughput constraints or rate limits become operationally prohibitive?
1,000+ properties per second (Nearmap AI Gen 6) vs. 23 seconds per property (Claude Opus 4.8), 49 seconds (Claude Fable 5), and 98 seconds (Gemini 3.1 Pro). At 100,000 properties, the difference is orders of magnitude. (see Section 3.4)
How are model updates managed and tested against property intelligence benchmarks before release? How are performance changes communicated, and what recourse exists if an update degrades performance on our specific use case?
Foundation model updates may degrade task-specific performance without appearing in a relevant changelog. (see Section 4.2)
What is the model’s performance on edge cases relevant to our dataset, shadow, occlusion, surface variation, and regional imagery characteristics? Can you provide error distribution data, not just aggregate scores?
False negatives concentrated in shadow, occlusion, and non-standard finishes. (see Section 3.2)
Does the model guarantee identical outputs for identical inputs across repeated runs?
Commercial VLMs do not. Same prompt, same image, different result. (see Section 3.5)
What are the cost implications of model updates, new feature attributes, or vendor pricing changes over time? How does the total cost of ownership compare when accuracy errors (missed features, false detections) are factored into downstream operational costs?
At portfolio scale, low aggregate error rates accumulate into material underwriting exposure and operational cost. (see Section 5.1)
What audit trail does the model provide for its outputs? For general-purpose foundation models, how is accountability established when a detection error affects a regulated decision? What documentation exists to support compliance or dispute resolution?
Output consistency and model versioning directly affect auditability. (see Sections 3.5, 4.2)