Evaluating LLM Benchmarks with Pulsar

MMLU is the AI industry's most-cited benchmark. We mapped its actual geometric structure and found that the 57 subject labels bear little resemblance to what embedding models actually see—and the gaps have direct consequences for any system built on top of those embeddings.

New to Pulsar? Read our intro post first for background on the Thema algorithm.

What is MMLU?

In 2020, Hendrycks et al. introduced MMLU — the Massive Multitask Language Understanding benchmark. The premise was simple: a truly capable language model should be able to answer questions across the full breadth of human knowledge. They assembled approximately 14,000 multiple-choice questions spanning 57 academic subjects, from elementary mathematics and world history to professional medicine and jurisprudence.

MMLU quickly became the benchmark. Every major AI lab reports an MMLU accuracy on release day—OpenAI with GPT-4, Google with Gemini, Anthropic with Claude, Meta with Llama. Leaderboards rank models by a single percentage. The press picks it up. Investors cite it. MMLU became the closest thing the field has to a standardized test for language models.

But a single number hides everything. A model scoring 85% on MMLU might be near-perfect on formal logic yet catastrophically wrong on moral reasoning—and the headline figure reveals nothing about that. Worse, MMLU's 57 subject labels are administrative categories chosen by the benchmark authors. They aren't a map of how the questions actually relate to each other in embedding space, and they aren't a map of where models succeed or fail. So the obvious question: what is the real structure, and what does it reveal that aggregate scores can't?

Why One Embedding Isn't Enough

Any single embedding model is just one perspective on your data, shaped by architecture choices, training dynamics, and random initialization. Relying on a single view means inheriting its biases—structure that appears in one model's embedding space might be an artifact of that model, not a genuine property of the data.

The alternative is to generate many representations and extract only the structure that remains stable across all of them. This is the core idea behind the multiverse analysis framework developed in co-founder Jeremy Wayland's work on representation learning (ICML 2024). Thema, the algorithm underlying Pulsar, operationalizes this: systematically generate hundreds of representations, build a local topological graph for each one, then fuse them into a single cosmic graph where edge weights reflect how consistently two data points share a neighborhood across all views.

Structure that persists in the cosmic graph is structure you can trust. Structure that appears in only a few views is noise. This distinction is invisible to any single-model evaluation.

The Experiment

We applied Pulsar's multiverse analysis to MMLU, fusing 100 distinct representations of the same 4,970 questions to extract structure that no single embedding could reliably identify on its own:

Embedding Models

BAAI BGE (×5), Qwen3-Embedding, nomic-embed, sentence-transformers (×3)

Variants per Model

Different prefixes, max lengths, and normalization settings

10,500

Graph Models Fused

100 representations × 105 graphs each

Spectral clustering on the resulting cosmic graph—using silhouette-driven k selection with zero manual tuning—identified the dataset's natural topological breakpoints.

What Pulsar Found

The Real Structure: 15 Regions, Not 57 Subjects

The cosmic graph resolves into 15 topological regions. The overlap between these regions and MMLU's 57 subject labels is only 47%(NMI = 0.470, ARI = 0.208). Questions from wildly different subjects—astronomy and philosophy, economics and biology—group together when they share structural properties like sentence length, syntactic patterns, or quantitative reasoning demands. Meanwhile, single subjects split across multiple regions because they contain fundamentally different different types of questions.

The largest regions blend questions from 30–50+ subjects (Region 3 alone spans 51 subjects with no single subject exceeding 16%). These are the broad thematic neighborhoods—cross-domain structure that subject labels miss entirely.

Pure island mappings are exact. Mixed flows illustrate the many-to-many pattern across the 12 non-island regions.

The Moral Scenarios Artifact

Region 8 is a geometric island: 318 questions, 100% from the “moral_scenarios” subject, completely isolated from the rest of MMLU. At first glance, this might seem like evidence that moral reasoning is semantically distinct. It isn't.

Every question in moral_scenarios shares an identical ~40-token prompt template prefix. The models aren't isolating these questions because of their content—they're isolating them because of their formatting. The embedding models are responding to a syntactic artifact, not performing moral reasoning.

This is exactly the kind of blind spot that aggregate accuracy hides. Multiple models show their single largest performance deviation in Region 8—swings of 18–34 percentage points from their overall mean. A leaderboard number absorbs that into the average. Topology makes it visible.

This pattern isn't unique to MMLU. Any dataset with repeated boilerplate—legal disclaimers, form headers, templated survey questions—will produce the same effect: embedding models will cluster by format, and retrieval systems built on those embeddings will inherit the blind spot.

The Law Island

Professional law questions form two nearly pure geometric regions—Region 4 (99.6% professional_law, n=226) and Region 11 (98.8%, n=257). Legal text has a distinctive syntactic fingerprint: dense conditional clauses, domain-specific vocabulary, citation-heavy structure. Models can score well on these regions by exploiting narrow structural cues rather than demonstrating broad legal reasoning—another case where high accuracy on a benchmark may not reflect the capability it claims to measure.

Model Failures Are Geometric, Not Random

When we scored five LLMs against Pulsar's regions, every model showed strong region-specific deviations from its own mean accuracy. The pattern is consistent: failures cluster in specific geometric neighborhoods, not randomly across the benchmark.

Model	Overall	Best Region	Worst Region
gemini-3.1-flash-lite	89.0%	R7 (+4.6%)	R12 (-17.6%)
grok-fast	81.8%	R6 (+18.3%)	R8 (-18.2%)
gpt-4.1-mini	81.5%	R2 (+8.5%)	R8 (-18.3%)
claude-3-haiku	65.6%	R6 (+34.4%)	R8 (-34.5%)

Region 8 (the moral_scenarios artifact) is a consistent failure zone across multiple models—further evidence that the benchmark is testing template recognition, not reasoning.

Explore the Topology

UMAP projection of the fused cosmic graph. Colors indicate topological regions (k=15). Hover for details.

Why This Matters Beyond Benchmarks

Benchmarks like MMLU don't exist in isolation—they shape how models are selected, how retrieval systems are built, and how organizations decide what's “good enough” for production. The structural blind spots we found in MMLU propagate directly into real-world systems.

RAG pipelines inherit embedding geometry.If your embedding model geometrically isolates legal text by syntactic pattern rather than semantic content, your retrieval system will retrieve based on formatting cues, not meaning. If it clusters moral reasoning questions by a shared template prefix, it will do the same to any domain data with similar structural repetition. Hallucinated retrievals aren't random—they follow the contours of the embedding space. Understanding that geometry is the first step toward fixing it.

Random sampling misses structural minorities. MMLU's 15 topological regions are vastly unequal in size—one region contains 900+ questions, another contains 5. Random evaluation sampling continuously misses small, structurally unique data neighborhoods. Our analysis shows that random sampling requires roughly 3× more datato achieve the same geometric coverage as topology-aware sampling. For enterprise teams running thousands of evaluation calls per release cycle, that's a direct cost multiplier.

Aggregate metrics mask deployment risk.A model with 85% overall accuracy and a 46-point drop in one geometric region is not a model you understand well enough to deploy confidently. Topology-aware evaluation surfaces these failure modes before they reach production—not as vague warnings about “edge cases,” but as specific, measurable geometric regions where a given model underperforms.

What's Next

MMLU was a starting point—a well-understood, publicly available benchmark where we could validate Pulsar's findings against known structure. But the same methodology applies anywhere embeddings are used to organize, retrieve, or evaluate data.

More benchmarks:We're running Pulsar against other widely-used evaluation suites to map where their administrative labels diverge from actual geometric structure.
Enterprise data:The same multiverse analysis that found MMLU's blind spots can map the topology of proprietary corpora—identifying retrieval gaps, redundancies, and structural risks before they surface as production failures.
Topology-aware evaluation as a practice: Instead of sampling randomly and hoping for coverage, teams can sample deterministically—one point per topological region—guaranteeing full geometric coverage at a fraction of the compute cost.

Reproduce This Analysis

Both notebooks are available on GitHub:

Single-Model Notebook Multiverse Notebook (10x10)