WeirdBench

WeirdBench

Unconventional LLM benchmarks.

GitHub

WeirdBench tests modern LLMs on the weird corners other evals skip.

Unconventional tasks, clear score direction, and a dead-simple score table. The benchmark definitions and code are written locally and published openly.

Recent benchmarks:

semantic-diversity

Semantic Diversity

Generate exactly 20 English words that are maximally semantically unrelated to each other, then score the average pairwise semantic similarity. Lower is better.

Top Models

anthropic/claude-opus-4.60.216
anthropic/claude-haiku-4.50.228
x-ai/grok-4.1-fast0.232
Lower score is better
View Benchmark