Rating systems across races

A single race tells you about relative abilities among the runners present, up to a translation. To build a rating system that compares runners across many races we have to stitch these per-race estimates together coherently. This page describes the three flavours thurstone supports: 1D global, 1D dynamic (over time), and multi-dimensional (over conditions).

One ability per item: curve-based global calibration

Assume each item $i$ has a scalar latent ability $\theta_i$. In race $r$ with intercept $b_r$ (the race-specific translation), runner $i$ has effective location $\mu_{r,i} = \theta_i + b_r$. The right way to fit $\{\theta_i, b_r\}$ jointly is to keep the original market prices in view at all times and match them in probability space, not in ability space. That is what GlobalAbilityCalibrator does in the Python package (and Chapter 11 of the book): for each race, tabulate the smooth, monotone ability→price curve $g_{r,i}(\mu)$ once on a grid, then run Gauss–Newton on $$L(\theta, b) = \tfrac12 \sum_{r,i \in H_r} w_{r,i} \bigl(g_{r,i}(\theta_i + b_r) - p_{r,i}\bigr)^2,$$ alternating block updates: one ridge step per race for $(b_r, v_r)$, one per item for $\theta_i$, rebuilding the curves every few outer iterations. Because each $g_{r,i}$ is a one-dimensional lookup, each Gauss–Newton step costs only interpolation and a tiny linear solve. The fit stays fully consistent with the Thurstone forward physics and recovers the centered $\theta^*$ almost perfectly even when items are sparsely connected across races.

A fast baseline: relative-then-LS stitch

For comparison and as a quick starting point, the GlobalLSCalibrator — the only one currently exposed in the JavaScript port and powering the multi-race stitching demo — shortcuts the whole loop. It inverts each race independently to get per-race locations $\hat\mu_{r,i}$, centers each race by its median (removing $b_r$), and averages the centered residuals per item with slope weights: $$\hat\theta_i = \frac{\sum_{r \ni i} w_{r,i}\,(\hat\mu_{r,i} - \mathrm{med}_r)}{\sum_{r \ni i} w_{r,i}}, \qquad w_{r,i} = \bigl| \partial p_{r,i} / \partial \mu \bigr|.$$ This is fast, diagonal by construction, and robust — but it never revisits the original prices, so all of the Thurstone structure has to flow through the noisy intermediate $\hat\mu_{r,i}$. Empirically, this gives noticeable regression toward the mean for items that appear in only a few races. Use it as a baseline, a warm start, or when you need a one-pass approximation at scale.

One ability per item, evolving in time

For real horses / players / models, ability drifts. KalmanAbilityTracker in the Python package treats per-race observations as noisy measurements of a slowly-varying state and runs a univariate Kalman filter per item. This is straightforward to bolt on top of the LS approach above — the JS port focuses on the static case, but the same idea applies.

Beyond one ability per item

When races test different skills — sprint vs stayer, clay vs grass, factual recall vs long-context reasoning — collapsing each item to a single scalar throws away the signal. The natural generalisation gives each item a small latent vector $z_h$ and each race a unit direction $v_r$ that selects which mixture matters; the effective ability is then $a_{r,h} = b_r + \langle v_r, z_h \rangle$. The fast ability transform still does the Thurstone race for each condition; only the parameterisation of the projected scalar changes. See the multi-dimensional page →.

Connection to Elo, Glicko, and friends

Elo and Glicko are practical rating systems derived from Bradley–Terry-style logistic updates: each game gives a signed update proportional to actual minus expected score. They are online approximations to the pairwise maximum-likelihood ability estimate. The thurstone-based LS / Kalman approaches above generalise these to multi-entrant contests with non-trivial fields and dead-heats — while remaining cheap enough to run on a laptop or, indeed, in your browser.

Connection to LLM preference learning

LLM alignment objectives like DPO are Bradley–Terry-of-the-LLM-world: efficient but pairwise and logit-based. Listwise methods (LiPO, 2024) move toward distributional/field objectives, closer in spirit to what this package does for races. Permutation self-consistency (Tang et al., 2023) restores set-invariance by averaging over input orderings — the same property the lattice algorithm has by construction.

Where to next