TimeGuessr AI Benchmark

Benchmark Overview

The TimeGuessr AI Benchmark evaluates vision-capable large language models on historical image analysis using the TimeGuessr dataset. Models are tasked with predicting both the year and geographic location where historical photographs were captured, requiring integration of visual, cultural, and historical knowledge.

The benchmark uses the official TimeGuessr scoring algorithm with a maximum of 10,000 points per image (5,000 for location accuracy, 5,000 for temporal accuracy). This dual-task evaluation provides insights into models' spatial reasoning, cultural understanding, and temporal inference capabilities.

Built with the Vercel AI SDK, the benchmark tool supports parallel model evaluation with real-time progress tracking through a live CLI interface, enabling comprehensive comparison of multiple LLMs simultaneously.

Task Definitions

Geographic Localization

Predicting the geographic location where an image was captured

Task Formulation

Given a historical image I, predict the geographic coordinates (latitude, longitude) where the image was captured. Models receive no contextual information beyond the visual content.

Visual Cues

• Architectural styles and building materials
• Vegetation and landscape features
• License plates and signage
• Cultural and social indicators

Temporal Estimation

Estimating the year or time period when an image was captured

Task Formulation

Given a historical image I, predict the year Y when the image was captured (range: 1800-2024). Models analyze visual temporal indicators without additional metadata.

Temporal Indicators

• Technology and device appearances
• Fashion and clothing styles
• Vehicle models and designs
• Image quality and color characteristics

Official TimeGuessr Scoring System

The benchmark uses the exact scoring algorithm from the original TimeGuessr game to ensure compatibility and meaningful comparison with human performance. This scoring system rewards both temporal and spatial accuracy with a maximum of 10,000 points per image.

Location Scoring (Maximum 5,000 Points)

Location accuracy is scored using distance-based thresholds with the Haversine formula:

≤50m: 5,000 points

50m-1km: 5,000 - (distance × 0.02)

1km-5km: 4,980 - (distance × 0.016)

5km-100km: 4,900 - (distance × 0.004)

100km-1000km: 4,500 - (distance × 0.001)

1000km-2000km: 3,500 - (distance × 0.0005)

2000km-3000km: 2,500 - (distance × 0.00033333)

3000km-6000km: 1,500 - (distance × 0.0002)

6000km: 12 points (minimum score)

Temporal Scoring (Maximum 5,000 Points)

Year predictions are scored based on absolute difference from actual year:

Exact year: 5,000 points

1 year off: 4,950 points

2 years off: 4,800 points

3 years off: 4,600 points

4 years off: 4,300 points

5 years off: 3,900 points

6-7 years off: 3,400 points

8-10 years off: 2,500 points

11-15 years off: 2,000 points

16-20 years off: 1,000 points

20 years off: 0 points

Total Score = Location Score + Year Score (Maximum: 10,000 points)

Model Evaluation Protocol

Standardized Prompting System

Consistent prompt engineering across all evaluated models using structured JSON responses

System Prompt

"You are an expert historian and geographer analyzing historical photographs. Your task is to predict the year and location where each photograph was taken based on visual clues including clothing styles, architecture, vehicles, technology, street furniture, signage, and cultural indicators."

User Prompt Template

"Analyze this historical photograph and predict: 1) The year it was taken (as precisely as possible), 2) The geographic location where it was taken (latitude and longitude). Look carefully at all visual clues including clothing, architecture, vehicles, technology, signs, and other temporal or geographical indicators."

Response Schema (JSON)

{
  "year": number (1800-2024),
  "location": {
    "lat": number (-90 to 90),
    "lng": number (-180 to 180)
  },
  "confidence": number (0-1),
  "reasoning": "Detailed explanation of visual clues"
}

Technical Implementation

Built with Vercel AI SDK for consistent model interaction and parallel processing

Parallel Model Evaluation

• Concurrent API calls to multiple LLM providers
• Real-time progress tracking with live CLI interface
• Individual model completion callbacks
• Configurable concurrency limits for rate limiting

Response Processing & Validation

• Zod schema validation for structured JSON responses
• Automatic text repair for malformed JSON outputs
• Geographic coordinate boundary validation
• Temporal prediction bounds checking (1800-2024)
• Generation time measurement for performance analysis

Error Handling & Reliability

• Optional skip-errors mode for unattended runs
• Individual model failure isolation
• Comprehensive error logging and reporting
• Partial result preservation on failures

Performance Analytics & Metrics

Aggregate Statistics

Comprehensive performance metrics generated for each model:

• Average and median TimeGuessr scores
• Mean absolute error (MAE) for year predictions
• Average and median distance error (km)
• Perfect predictions count (exact year, ≤50m location)
• Generation time analysis (mean, median, total)

Dataset Integration

Direct integration with TimeGuessr scraped dataset:

• Historical images with verified year and location metadata
• Country and description information for context analysis
• Configurable sample sizes for different evaluation scales
• Random sampling to ensure representative testing

Benchmark Configuration

CLI Interface Features

• Real-time progress visualization with Ink library
• Live model completion status tracking
• Configurable concurrency for API rate limiting
• Error handling with optional skip-errors mode
• JSON export for detailed result analysis

Supported Models

• OpenAI GPT-4 Vision variants
• Google Gemini Pro and Flash models
• Anthropic Claude with vision capabilities
• Extensible model configuration system
• Provider-agnostic evaluation framework

Implementation & Reproducibility

Open Source Components

• Complete benchmark implementation source code
• CLI tool with Vercel AI SDK integration
• TimeGuessr scoring algorithm implementation
• Model configuration and prompt templates
• Real-time progress tracking interface

Data & Results

• Structured JSON output with complete predictions
• Performance summaries and aggregate statistics
• Individual model scores and reasoning
• Generation time and error analysis
• Compatible with TimeGuessr dataset format

Benchmark Usage Example

# Run benchmark with 50 images across multiple models
bun run benchmark run --max-images 50 --models gpt4v,claude3,gemini --concurrency 3 --output results.json

TimeGuessr AI Benchmark Methodology

Benchmark Overview

Task Definitions

Task Formulation

Visual Cues

Task Formulation

Temporal Indicators

Official TimeGuessr Scoring System

Model Evaluation Protocol

System Prompt

User Prompt Template

Response Schema (JSON)

Parallel Model Evaluation

Response Processing & Validation

Error Handling & Reliability

Performance Analytics & Metrics

CLI Interface Features

Supported Models

Implementation & Reproducibility

Open Source Components

Data & Results

Benchmark Usage Example