Multi-dimensional ability

A scalar ability $\theta_h$ assumes every race tests the same skill. Most real contests don't: horses run different distances on different surfaces; players have grass games and clay games; LLMs that win on factual recall lose on long-context reasoning. Forcing one number per item then averaging across races throws away the signal that those races are measuring different things.

The fix is to give each item a small latent vector, and each race a small latent direction that selects which combination of the item's skills matters for that race. Everything else — the Thurstone forward physics, the lattice machinery, the curves — is unchanged.

See it in 2D

Five horses in a (speed, stamina) plane. The arrow is the race's skill ray $v_r$. Slide the ray from horizontal (pure sprint) to vertical (pure stayer): the favourites shuffle because each horse's effective ability is its projection onto the ray. Same 5 horses, completely different rankings depending on what the race is asking for.

race ray angle

0°

The model

Pick a small dimension $d \ge 1$ (typically 2 or 3). Each item $h$ has a latent embedding $z_h \in \mathbb{R}^d$. Each race (or more generally, each condition) $r$ has a unit-norm direction $v_r \in \mathbb{R}^d$ and a scalar bias $b_r$. The effective ability of item $h$ in race $r$ is $$ a_{r,h} \;=\; b_r \;+\; \langle v_r,\, z_h \rangle. $$ In words: project the item's vector onto the race's direction, shift by the race's bias. The projected scalar is what feeds into the Thurstone contest for that race — exactly as if it were a scalar ability in the 1D model.

Reading the dimensions:

$d = 1$, $v_r \equiv 1$: collapses to the scalar global model $a_{r,h} = b_r + \theta_h$.
$d = 2$: items live on a plane; each race is a unit vector picking out a direction. The classic example is sprinters vs stayers — one axis for raw speed, another for stamina, and the race's ray says how it weights the two.
$d = 3$: a third axis (going, surface, concentration…). Still cheap to fit; the linear-algebra cost per inner step is $O(d^3)$ per item.

Why it's still a Thurstone race

Given $z$, $v$, $b$ we get the projected scalar ability $a_{r,h}$ for every item in every race. From that point the model is identical to the 1D case: shift the base density by $a_{r,h}$, build the field, read off state prices via multiplicity-aware payoffs. The contest physics doesn't care that the ability was constructed by a dot product instead of given directly.

What multi-dimensional structure buys is coupling across races. Two races whose directions $v_r$, $v_{r'}$ are not orthogonal share information about every item that runs in both. A horse that performs well at a short distance constrains its $z_h$, which then shows up — perhaps with a different sign — at a long distance. The model explains all races with a shared geometry rather than a fresh ability per (item, condition).

Fitting: block Gauss-Newton on the cached curves

The same trick that made the scalar global fit cheap works here. For each race $r$ and item $h$, tabulate the ability$\to$price curve $g_{r,h}(\mu)$ once on a grid (using the single-race machinery). Then alternate two block updates on the joint objective $$ L(z, v, b) \;=\; \tfrac12 \sum_r \sum_{h \in H_r} w_{r,h}\, \bigl( g_{r,h}(b_r + v_r^{\!\top} z_h) - p_{r,h} \bigr)^2. $$

Race-side step (with $z$ fixed): linearise $g_{r,h}$ in $(b_r, v_r)$, define the pseudo-response $y_{r,h} = -e_{r,h}/s_{r,h}$ (with the slope $s_{r,h} = g'_{r,h}$ from the cached curve), and solve a $(1 + d)$-dimensional ridge regression per race: $$ \min_{\delta b_r,\,\delta v_r}\, \sum_{h \in H_r} \bigl( \delta b_r + \delta v_r^{\!\top} z_h - y_{r,h} \bigr)^2 + \lambda_v \|\delta v_r\|^2. $$ Update $b_r$ and $v_r$ with a step-size, renormalise $\|v_r\|$ to 1.

Item-side step (with $b, v$ fixed): for each item, linearise across the races it ran in and solve a $d$-dimensional ridge problem: $$ \min_{\delta z_h}\, \sum_{r \in \mathcal{R}(h)} \bigl( v_r^{\!\top} \delta z_h - y_{r,h} \bigr)^2 + \lambda_z \|\delta z_h\|^2. $$

Every few inner iterations, rebuild the cached curves $g_{r,h}$ around the current projected abilities. This is what MultiRayGlobalCalibrator does (in both Python and the JS port); the per-step cost is dominated by curve rebuilds, not the linear algebra.

Identifiability — what the data can't pin down

A latent-embedding model has gauges. Three of them are unobservable from the prices alone:

Translation. Shifting all $z_h$ by the same vector and absorbing into $\beta_c$ leaves every projected ability unchanged.
Scale. Rescaling all $z_h$ by $\alpha$ and all $v_r$ by $1/\alpha$ leaves the dot products invariant.
Rotation. Any orthogonal transform applied jointly to $\{z_h\}$ and $\{v_r\}^{-\top}$ leaves all projections invariant.

We fix the first two by (1) centring $\{z_h\}$ to mean zero each outer iteration and (2) constraining each $\|v_r\| = 1$ after every update. The third — rotation — is left intact. Comparisons $\langle v_r, z_h \rangle$ are invariant under it, but the raw coordinates of $z_h$ are not interpretable in isolation: don't report “item $h$ has speed-component 0.42” without nailing down a reference frame first.

Worked use cases

Horses across distances and surfaces. Two-dimensional embedding; sprint races have $v_r$ near one axis, stayers' races near the other; the bias absorbs surface-day-going.
Players across opponents or formats. Tennis (clay vs grass), chess (classical vs blitz), e-sports (different maps). Each format is a condition with its own $v_r$.
LLM benchmarks. Items are models, conditions are benchmarks. The embedding $z_h$ describes a model's skill profile; the ray $v_r$ describes which skills a benchmark exercises. Predicting performance on a held-out benchmark is just $g_{r,h}(b_r + v_r^{\!\top} z_h)$.

Code

// JS
import {
  UniformLattice, Density, AbilityCalibrator,
  MultiRayGlobalCalibrator,
} from "../js/thurstone/index.js";

const lat   = new UniformLattice(1000, 0.05);
const base  = Density.skewNormal(lat, { loc: 0, scale: 1, a: 0 });
const items = ["A", "B", "C", "D", "E"];
const mr    = new MultiRayGlobalCalibrator(items, { dim: 2 });

for (const c of conditions) {
  const cal = new AbilityCalibrator(base, { nIter: 3 });
  cal.solveFromPrices(c.prices);
  mr.addCondition({ condId: c.id, calibrator: cal,
                    itemIds: c.ids, prices: c.prices });
}
mr.fitWithRebuild({ outer: 3, inner: 10 });

const predictedC1 = mr.predictCondition("c1");

The Python equivalent is in thurstone/multiray.py; both implementations track the same gauge fixes and Gauss-Newton inner steps. See also Chapter 13 of the book for the geometric derivation.

Where to next

Rating systems — the scalar (one ability per item) story, which this page generalises.
2D (loc, scale) inversion demo — a different kind of two-dimensional fit (per-runner scale as consistency), not the same thing as latent embeddings but related in spirit.
thurstone/multiray.py — the reference implementation.