Out of curiosity, is there a reason why you are using ELO proper, rather than one of the ELO variants that doesn't make assumptions about the distribution of results? E.g.:
Hey! We actually did a lot of research into ELO consistency, i.e. to check whether or not the NxN pairwise matrix followed the ELO model. It was a long road that's probably grounds for an entirely separate blog post, but the TLDR is that we observe that:
For each document, there is a secret hidden score "s" which is the "fundamental relevance according to the LLM". Then, when we sample (q, d1, d2) from the LLM, the LLM follows the statistical property that:
- The "fundamental hidden preference" is `pref = s_{d1} - s_{d2}`, usually ranging between -4 and 4.
- The LLM will sample a normal distribution around the `pref` with stddev ~0.2, which is some "inner noise" that the LLM experiences before coming to a judgement.
- The preference will pass through the sigmoid to get a sampled_score \in [0, 1].
- There is an additional 2% noise. i.e., 0.98 * sampled_score + 0.02 * random.random()
When we use Maximum Likelihood Estimation to find the most likely predicted "hidden scores" \hat{s} associated with each document, then we go ahead and sample pairwise matrices according to `0.98 * sigmoid( \hat{s}_1 - \hat{s}_2 + N(0, 0.02) ) + Uniform(0.02)`, then we get a pairwise matrix with virtually identical statistical properties to the observed pairwise matrices.
More confused,
1) 0.02 * random.random() != N(0, 0.02)
2) The LLM will sample a normal distribution, this only depends on your c parameter, the absolute scale doesn't matter neither in Bradley-Terry nor in Elo. So saying +-4 and claiming LLM reasoning in Standard normal is ridiculous.
3) > then we get a pairwise matrix with virtually identical statistical properties to the observed pairwise matrices. >>> then did you asked yourselves if I have "statistically identical" pair-wise matrix and observed pairwise matrix, the. why you even bother myself? You can simply use observed pairwise matrix...