Model Rankings

Leaderboard

Comprehensive rankings of AI models based on their performance in geographic location and temporal prediction tasks.

Model Rankings

Comprehensive rankings of AI models based on their performance in geographic location and temporal prediction tasks. Click column headers to sort.

#
#1	gemini-2.5-pro	google	9587	9925	1.8	8.6	14	4	19.96	30
#2	gpt-5	openai	9364	9834	2.6	11.3	13	2	43.02	30
#3	gpt-4o	openai	9241	9767	2.4	154.3	13	1	9.22	30
#4	gpt-5-mini	openai	9015	9796	2.8	568.3	10	1	20.30	30
#5	mistral-medium-3.1	mistralai	8779	9321	2.8	754.8	9	0	11.83	28
#6	claude-opus-4	anthropic	8718	9219	4.0	236.9	9	0	13.76	29
#7	qwen2.5-vl-72b-instructOpen Source	qwen	8716	9562	3.8	708.8	14	1	19.55	27
#8	gemma-3-27b-itOpen Source	google	8698	9241	3.3	537.8	7	0	7.95	30
#9	gemini-2.5-flash-lite	google	8596	9447	3.8	650.3	7	0	5.04	30
#10	claude-sonnet-4	anthropic	8583	9155	3.6	589.6	12	0	11.35	30
#11	gemini-2.0-flash-001	google	8532	9005	3.6	678.4	10	0	4.49	30
#12	llama-4-maverickOpen Source	meta-llama	8349	8900	5.3	715.4	8	0	5.48	30
#13	gpt-5-nano	openai	8162	9200	5.7	955.8	7	0	17.11	30
#14	mistral-smallOpen Source	mistral	8137	8752	5.8	737.0	6	0	6.99	30

Showing 14 of 14 models

Performance Analysis

Data Visualization

Top Models Average (Descending)

Average location and year scores of top models (stacked)

Location vs Year Accuracy

Trade-offs between geographic and temporal prediction accuracy

Accuracy vs Speed

Trade-off between model accuracy and generation time

Year Prediction Errors

Distribution of year prediction accuracy across models

Multi-Metric Comparison (Top 5 Models)

Comprehensive performance analysis across all key metrics

Perfect Predictions

Comparison of perfect location and year predictions

Generation Time Analysis

Average processing time across different models

Key Insights

Notable patterns from the benchmark results

Top Performers

Claude 3.5 Sonnet leads in both geographic accuracy (82.1%) and temporal precision (10.8 year MAE), demonstrating superior multimodal reasoning capabilities.

Open Source Models

LLaVA-1.6-34B shows competitive performance among open-source alternatives, achieving 71.8% location accuracy while being fully accessible to researchers.

Performance Gap

There's a significant 30-point gap between the best and worst performing models, highlighting the varying capabilities in spatial-temporal reasoning.

Evaluation Notes

Important considerations for interpreting results

Methodology

All models evaluated using identical prompts and test conditions. Results represent average performance across 2,500 diverse images from 150 countries.

Statistical Significance

Performance differences between adjacent ranks are statistically significant (p < 0.05) using paired t-tests with Bonferroni correction.

Regular Updates

Leaderboard updated monthly with new model releases and improved evaluation protocols. Last updated: December 2024.

Explore Model Comparisons

Dive deeper into individual model performance with our interactive comparison tool. Analyze predictions on specific images and understand model reasoning patterns.