Model Rankings

Leaderboard

Comprehensive rankings of AI models based on their performance in geographic location and temporal prediction tasks.

Model Rankings
Comprehensive rankings of AI models based on their performance in geographic location and temporal prediction tasks. Click column headers to sort.
#
#1
google logogemini-2.5-pro
google
9587
9925
1.8
8.6
14
4
19.96
30
#2
openai logogpt-5
openai
9364
9834
2.6
11.3
13
2
43.02
30
#3
openai logogpt-4o
openai
9241
9767
2.4
154.3
13
1
9.22
30
#4
openai logogpt-5-mini
openai
9015
9796
2.8
568.3
10
1
20.30
30
#5
mistral logomistral-medium-3.1
mistralai
8779
9321
2.8
754.8
9
0
11.83
28
#6
anthropic logoclaude-opus-4
anthropic
8718
9219
4.0
236.9
9
0
13.76
29
#7
qwen logoqwen2.5-vl-72b-instructOpen Source
qwen
8716
9562
3.8
708.8
14
1
19.55
27
#8
google logogemma-3-27b-itOpen Source
google
8698
9241
3.3
537.8
7
0
7.95
30
#9
google logogemini-2.5-flash-lite
google
8596
9447
3.8
650.3
7
0
5.04
30
#10
anthropic logoclaude-sonnet-4
anthropic
8583
9155
3.6
589.6
12
0
11.35
30
#11
google logogemini-2.0-flash-001
google
8532
9005
3.6
678.4
10
0
4.49
30
#12
meta logollama-4-maverickOpen Source
meta-llama
8349
8900
5.3
715.4
8
0
5.48
30
#13
openai logogpt-5-nano
openai
8162
9200
5.7
955.8
7
0
17.11
30
#14
mistral logomistral-smallOpen Source
mistral
8137
8752
5.8
737.0
6
0
6.99
30
Showing 14 of 14 models

Performance Analysis

Data Visualization
Top Models Average (Descending)
Average location and year scores of top models (stacked)
Location vs Year Accuracy
Trade-offs between geographic and temporal prediction accuracy
Accuracy vs Speed
Trade-off between model accuracy and generation time
Year Prediction Errors
Distribution of year prediction accuracy across models
Multi-Metric Comparison (Top 5 Models)
Comprehensive performance analysis across all key metrics
Perfect Predictions
Comparison of perfect location and year predictions
Generation Time Analysis
Average processing time across different models
Key Insights
Notable patterns from the benchmark results

Top Performers

Claude 3.5 Sonnet leads in both geographic accuracy (82.1%) and temporal precision (10.8 year MAE), demonstrating superior multimodal reasoning capabilities.

Open Source Models

LLaVA-1.6-34B shows competitive performance among open-source alternatives, achieving 71.8% location accuracy while being fully accessible to researchers.

Performance Gap

There's a significant 30-point gap between the best and worst performing models, highlighting the varying capabilities in spatial-temporal reasoning.

Evaluation Notes
Important considerations for interpreting results

Methodology

All models evaluated using identical prompts and test conditions. Results represent average performance across 2,500 diverse images from 150 countries.

Statistical Significance

Performance differences between adjacent ranks are statistically significant (p < 0.05) using paired t-tests with Bonferroni correction.

Regular Updates

Leaderboard updated monthly with new model releases and improved evaluation protocols. Last updated: December 2024.

Explore Model Comparisons

Dive deeper into individual model performance with our interactive comparison tool. Analyze predictions on specific images and understand model reasoning patterns.