Reward-Free Comparison Methods for Language Models

Background: DeepSeek R1 and RL for Reasoning

DeepSeek's R1 paper demonstrates that cold-starting reinforcement learning (RL) enables models to develop reasoning capabilities because RL incentivizes exploration. However, this approach has a critical limitation: models may hallucinate or produce incorrect outputs due to initial jitter during training.

The momentum problem: Research from Anthropic shows that if a model accidentally outputs a wrong token, it's likely to continue along that flawed chain of thought due to generation momentum (Arxiv). This creates a compound error effect where early mistakes propagate through the entire response.

Best-of-N Sampling

Best-of-N sampling is a simple but powerful technique: generate multiple responses (N candidates) and select the best one based on some evaluation criteria. The challenge lies in determining what "best" means.

Selection Criteria Options

External Critic Models

One approach is to train separate models specifically to evaluate and rank the generated responses based on quality criteria.

Key insight: Generation is harder than classification. It's far easier to train a model to distinguish between good and bad outputs than to teach it to consistently generate high-quality outputs.

Why is this true?

\[\textbf{Classification}\] \[p(y \mid x), \quad |\mathcal{Y}| = K\] \[\text{Pick 1 label out of } K\] \[\textbf{Generation}\] \[p(y_1,\dots,y_T \mid x) = \prod_{t=1}^{T} p(y_t \mid \cdot)\] \[|\mathcal{Y}| = |\mathcal{V}|^T\] \[\text{Pick a whole sequence, step by step}\] \[K \quad \ll \quad |\mathcal{V}|^T\] \[\text{Classification: one choice}\] \[\text{Generation: a path through an exponential tree}\]

Types of Critic Models:

Process Reward Models (PRMs)

Evaluate the reasoning process step-by-step
Score the quality and relevance of intermediate thinking
Help identify where reasoning went wrong

Outcome Reward Models (ORMs)

Evaluate only the final answer
Simpler to train but less granular feedback

Example:

response1 = "<think>...and so I think the answer is 42</think><final>42</final>"
Critic(response1.final) = 0.7 

response2 = "<think>...and so I think the answer is 32</think><final>32</final>"
Critic(response2.final) = 0.1

Problems with the Critic Model Approach

High computational cost: Training critic models is expensive, and then you must run them during inference alongside the base model, effectively doubling inference costs.
Reward hacking: Critic models are imperfect proxies for true quality, and base models can learn to exploit their weaknesses.
- Example: If a critic learns that longer answers correlate with higher quality (more detailed reasoning, less ambiguity), the base model might generate verbose, fluffy responses that score highly despite lacking substance.

Solution: Critic-Free Comparison Methods

Instead of training separate critic models, we can use intrinsic properties of the model's own outputs to rank responses. These methods are reward-free because they don't require external reward models.

For Closed-Ended Questions: Self-Consistency

Self-consistency is beautifully simple: the answer that appears most frequently across multiple samples is likely correct.

Example:

response 1: 1 + 1 = 2
response 2: 1 + 1 = 4  
response 3: 1 + 1 = 2

2 appears in the majority of responses, so it's selected as the final answer.

Limitation: This approach breaks down for open-ended tasks like code generation, creative writing, or problems with many valid solutions.

For Open-Ended Questions: Probability-Based Metrics

When self-consistency isn't applicable, we can leverage the model's internal probability distributions.

Average Log Probability (AvgLogP)

This metric measures the average probability the model assigned to each token it actually generated:

\[\begin{aligned} \text{Step 1: } & \log P(x_1) \\ \text{Step 2: } & \log P(x_2 \mid x_1) \\ \text{Step 3: } & \log P(x_3 \mid x_1, x_2) \\ & \;\;\vdots \\ \text{Step T: } & \log P(x_T \mid x_1, \dots, x_{T-1}) \\ \text{AvgLogP: } & \frac{1}{T} \sum_{t=1}^{T} \log P(x_t \mid x_1, \dots, x_{t-1}) \end{aligned}\]

Higher AvgLogP means the model consistently chose high-probability tokens, indicating greater confidence in the generation.

Self-Certainty (KL-Based Metric)

Average log probability only considers the chosen tokens. But what about the shape of the entire probability distribution at each step?

Self-certainty measures how "peaked" the model's token distributions are using KL divergence:

\[\text{Self-Certainty} = - \frac{1}{n_V} \sum_{i=1}^{n_X} \sum_{j=1}^{V_X} \log \Big( V \cdot p\big(j \mid x, y \le i\big) \Big)\]

where

\[p(j \mid x, y_{\le i}) \;=\; \text{full token distribution at step } i\] \[V=\text{Vocabulary Size}\] \[p(j)=\tfrac{1}{V} \;\Rightarrow\; \log(V p(j)) = 0 \quad (\text{uniform} \;\Rightarrow\; \text{uncertain})\] \[\text{peaked } p \;\Rightarrow\; \log(V p(j)) > 0 \quad (\text{confident})\] \[\text{We divide by } \frac{1}{n_v} \text{to make the metric independent of sequence length} \\ \text{where } n_v \text{is the number of tokens produced}\]

Intuitively:

Flat distribution → uncertain model → lower self-certainty
Peaked distribution → confident model → higher self-certainty

Why this works: KL divergence measures how far the probability distribution deviates from uniform. A confident model places most probability mass on a few tokens (high KL), while an uncertain model spreads probability across many tokens (low KL).

The most uncertain model gives the uniform distribution

Research shows that self-certainty more effectively distinguishes correct from incorrect samples, especially at higher N values, compared to simple AvgLogP.

Enhanced Methods: Combining Self-Certainty with Voting

For closed-ended questions like math problems, we can combine self-certainty scoring with voting mechanisms:

Borda Count Voting:

Generate N responses
Rank all responses by their self-certainty scores
Assign weighted votes based on rank (highest-ranked gets most votes)
Select the answer with the most total votes

This approach leverages both the model's confidence (via self-certainty) and consensus (via voting), providing a robust selection mechanism without requiring external reward models.

Empirical Results

Research demonstrates that KL-divergence-inspired distributional confidence metrics:

More effectively distinguish correct samples from incorrect ones
Achieve superior accuracy at higher N values
Outperform simple token-probability metrics

Conclusion

Reward-free comparison methods offer a compelling alternative to traditional critic-model approaches:

Advantages:

No need to train expensive critic models
No inference-time overhead from running separate models
Immune to reward hacking
Leverage the base model's own uncertainty estimates

Based on: https://arxiv.org/pdf/2502.18581

My Naive implementation: Google Colab

TCombinator

Review: Scalable Best-of-N Selection for Large Language Models via Self-Certainty