WeirdBench

semantic-diversity

Semantic Diversity

Generate exactly 20 English words that are maximally semantically unrelated to each other, then score the average pairwise semantic similarity. Lower is better.

Top Models

anthropic/claude-opus-4.60.216

anthropic/claude-haiku-4.50.228

x-ai/grok-4.1-fast0.232

Lower score is better

View Benchmark

WeirdBench tests modern LLMs on the weird corners other evals skip.

Recent benchmarks:

Semantic Diversity