LLM Benchmark Explorer

80+ models. 12 benchmarks. One chart. Compare parameter counts against real performance — and see why the biggest model isn’t always the best.

LLM Benchmark vs. Parameter Count

56 models plotted

Hollow rings = undisclosed parameter counts (community estimates). Green zone = practical “good enough” threshold for general daily use. MoE models have far fewer active params per token than total — toggle “Active” in the sidebar to see compute-normalized positions.

Benchmark

MoE Parameter View

Closed Models

Parameter Range

Filter by family

Dot styles

Confirmed params (open-weight)Estimated params (closed model)
GPQA DiamondReasoningintroduced 2023

Graduate-Level Google-Proof Q&A — 448 expert-written multiple-choice questions in biology, chemistry, and physics. Questions are intentionally designed to be unsolvable by googling; even PhD-level domain experts score only ~65–70%. The single most discriminating benchmark still separating frontier models as of 2026.

Still Active
56 models plotted
7B–14B “General Use” threshold — everyday tasks, consumer hardware, interactive latency