LLM Benchmark Explorer
80+ models. 12 benchmarks. One chart. Compare parameter counts against real performance — and see why the biggest model isn’t always the best.
LLM Benchmark vs. Parameter Count
56 models plottedHollow rings = undisclosed parameter counts (community estimates). Green zone = practical “good enough” threshold for general daily use. MoE models have far fewer active params per token than total — toggle “Active” in the sidebar to see compute-normalized positions.
GPQA DiamondReasoningintroduced 2023
Graduate-Level Google-Proof Q&A — 448 expert-written multiple-choice questions in biology, chemistry, and physics. Questions are intentionally designed to be unsolvable by googling; even PhD-level domain experts score only ~65–70%. The single most discriminating benchmark still separating frontier models as of 2026.
7B–14B “General Use” threshold — everyday tasks, consumer hardware, interactive latency