From PPO to Search-R1: The Design Space of Reasoning and Agentic RL

Preface

The major advances in reasoning LLMs over the past year — o1, DeepSeek-R1, Search-R1 — share a common RL foundation, but each makes different choices at a few critical design points. Laying these choices side by side makes the design space much clearer than reading any single paper on its own.

That is what this post does. The main thread is Search-R1¹: it is a clean instance of Agentic RL, and it is simple enough to naturally tie together PPO, GRPO, verifiers, and token masking.

Readers are assumed to be familiar with transformers, SFT, and basic deep learning. Familiarity with RL internals is not assumed.

Not covered: RLHF preference-data construction, inference-time scaling (best-of-N, tree search), multi-agent, low-level hardware optimizations.

1. Problem Setup: Evolution of the Post-training Stack

Moving from a pretrained LLM to an agent that can autonomously call tools goes through several technical jumps. Each jump solves a limitation of the previous one:

Pretrain
   │ knows a lot, but doesn't follow instructions
   ↓
SFT (Supervised Fine-Tuning)
   │ imitates only; can't actively suppress wrong behavior
   ↓
RLHF (PPO + Reward Model)
   │ RM gets hacked; preference data is expensive; weak reasoning signal
   ↓
Reasoning RL (GRPO + Verifier)
   │ reward only comes from the model's own outputs; no external interaction
   ↓
Agentic RL (+ Tool / Environment)

The technical core of each jump:

SFT: cross entropy + supervised learning
RLHF: PPO + neural reward model
Reasoning RL: GRPO + rule-based verifier; drops the reward model
Agentic RL: on top of Reasoning RL, embeds environment interaction into the rollout

“Agentic RL” is not a standalone new algorithm but a training paradigm that “introduces environment interaction into the rollout.” The algorithmic core is still PPO or GRPO. The key change is extension of the environment: tool calls are inserted mid-rollout, and execution / retrieval results are injected back into the sequence, becoming context for the next generation.

Search-R1 is the cleanest instance of this paradigm: it uses a search engine as the environment and lets the LLM autonomously decide when and what to search during reasoning.

Off this main thread there is a side branch: DPO turns RLHF directly into supervised learning — no RL rollout, no reward model. §7 discusses it separately.

The following sections unfold each component in turn.

2. From Cross Entropy to Policy Gradient

This section establishes the equivalence between RL loss and SFT loss, as groundwork for understanding PPO later.

2.1 Simplifying Cross Entropy under Next-Token Prediction

General definition:

\[H(p, q) = -\sum_i p_i \log q_i\]

In the next-token setting, the ground-truth label \(p\) is one-hot with \(p_{y_t} = 1\) and zero elsewhere. Summing over the vocabulary leaves a single term:

\[H(p, q) = -\log \pi_\theta(y_t \mid y_{<t})\]

SFT sums over the whole sequence:

\[\mathcal{L}^{SFT} = -\sum_{t=1}^{T} \log \pi_\theta(y_t \mid y_{<t})\]

2.2 The Subtle Distinction Between CE and KL

The identity:

\[H(p, q) = H(p) + D_{KL}(p \| q)\]

In supervised learning \(p\) is the fixed ground truth, so \(H(p)\) is a constant with respect to \(\theta\) — meaning optimizing CE and optimizing KL have identical gradients. We default to CE only because it is cheaper.

In RL this equivalence breaks down: both sides of the KL are models (\(\pi_\theta\) vs \(\pi_{ref}\)), neither is fixed. The distinction between CE and KL becomes genuinely important here. We will use this below when discussing PPO’s KL penalty.

2.3 From SFT to REINFORCE

The SFT loss unconditionally pushes up the probability of every token — every token in the demo data is assumed to be good. That assumption is fine for imitation learning, but RL wants more: push up good behaviors, and actively suppress bad ones.

REINFORCE multiplies the SFT loss by a weight \(A_t\):

\[\mathcal{L}^{REINFORCE} = -\mathbb{E}\Big[\sum_t \log \pi_\theta(y_t \mid y_{<t}) \cdot A_t\Big]\]

\(A_t > 0\): good token, push probability up
\(A_t < 0\): bad token, push probability down
The larger \(\lvert A_t \rvert\) is, the larger the adjustment

The three losses in unified form:

Loss	Per-token form	Meaning of the weight
SFT	\(-\log \pi_\theta(y_t)\)	1 (push up every token)
REINFORCE	\(-\log \pi_\theta(y_t) \cdot A_t\)	advantage-weighted
PPO	\(-\min(r_t A_t, \text{clip}(r_t) A_t)\)	ratio replaces log (next section)

Key insight: the RL loss is structurally “signed, weighted cross entropy.” With \(A_t \equiv 1\), SFT and RL coincide in gradient form. Note this is only a formal analogy — SFT samples from demo data while RL samples from \(\pi_\theta\); the sample distributions are different.

2.4 Code Comparison

# SFT loss
def sft_loss(logits, actions, mask):
    log_probs = F.log_softmax(logits, dim=-1)
    token_logprobs = log_probs.gather(-1, actions.unsqueeze(-1)).squeeze(-1)
    return -(token_logprobs * mask).sum() / mask.sum()

# REINFORCE loss: only differs by * advantages
def reinforce_loss(logits, actions, advantages, mask):
    log_probs = F.log_softmax(logits, dim=-1)
    token_logprobs = log_probs.gather(-1, actions.unsqueeze(-1)).squeeze(-1)
    return -(token_logprobs * advantages * mask).sum() / mask.sum()

The only difference is one * advantages. That one multiplication opens the gap between SFT and RL: RL can suppress bad behavior with a negative weight; SFT cannot.

3. PPO: Design Motivation and the Role of Each Component

PPO² is the de-facto standard for LLM RL training today. This section breaks it down component by component, ordered as “component → what problem does it solve.”

Below is the LLM-RLHF version of PPO — it adds a \(\pi_{ref}\) KL penalty that the original PPO paper did not have. This addition is an engineering convention inherited from InstructGPT³, not part of the original PPO:

\[\mathcal{L}^{PPO}(\theta) = -\mathbb{E}_t\Big[\min\big(r_t(\theta) A_t,\ \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t\big)\Big] + \beta \cdot D_{KL}\big[\pi_\theta \| \pi_{ref}\big]\]

3.1 Starting Point: Two Pathologies of REINFORCE

Pathology 1: high variance; a single update may blow up. \(A_t\) is estimated from a single trajectory, which is high variance. Occasional high-reward samples can blow up \(A_t\) for certain tokens, and one gradient step can push \(\pi_\theta\) into a weird region.

Pathology 2: sample inefficiency. REINFORCE is strictly on-policy: once policy updates, the previous rollouts are useless. Unacceptable for LLMs where rollouts are extremely expensive.

Each component of PPO cures these two pathologies.

3.2 Importance Ratio: Reusing the Same Batch of Rollouts

\[r_t(\theta) = \frac{\pi_\theta(y_t \mid y_{<t})}{\pi_{old}(y_t \mid y_{<t})}\]

\(\pi_{old}\) is a snapshot of the policy at the time this batch of rollouts was sampled. The training loop becomes:

for iteration in range(N):
    # snapshot current policy
    pi_old = copy(pi_theta).detach()

    # collect rollouts using pi_old
    rollouts = sample_rollouts(pi_old, batch_size=B)
    advantages = compute_gae(rollouts)

    # multiple gradient updates on the same rollouts
    for epoch in range(K):  # typically K = 2~4
        for minibatch in split(rollouts):
            loss = ppo_loss(pi_theta, pi_old, minibatch, advantages)
            loss.backward()
            optimizer.step()

The same batch is reused \(K\) times. By the \(k\)-th epoch \(\pi_\theta \neq \pi_{old}\) — that is off-policy. The importance ratio mathematically corrects this mismatch (standard application of importance sampling).

Notational pitfall: \(r_t\) here is the importance ratio, not the reward. Reward becomes \(A_t\) via GAE in PPO; the reward itself does not appear in the loss.

3.3 Clip: Single-step Magnitude Constraint

The ratio alone is not enough. If \(\pi_\theta\) drifts far from \(\pi_{old}\) (e.g. \(r_t = 5\)), importance sampling variance explodes. Clip hard-caps it:

\[\text{clip}(r_t, 1-\epsilon, 1+\epsilon)\]

Typically \(\epsilon = 0.2\), meaning the single-token probability ratio can change by at most ±20%.

But clip alone has a trap. Say \(A_t = +1\) (a good token) and \(r_t = 0.5\) (the policy has been pushing it down — an error state). If we directly used \(\text{clip}(r_t) \cdot A_t = 0.8\), the gradient gets artificially inflated, encouraging the policy to keep going the wrong way.

3.4 Min: Let Clip Only Brake, Not Block Corrections

The full policy term:

\[L_t^{CLIP} = \min\big(r_t A_t,\ \text{clip}(r_t, 1-\epsilon, 1+\epsilon) \cdot A_t\big)\]

With \(\epsilon = 0.2\), four cases:

Case	\(A_t\)	\(r_t\)	\(r_t A_t\)	\(\text{clip}(r_t) A_t\)	min picks	Meaning
B	\(+1\)	\(1.5\) (above upper bound)	\(1.5\)	\(1.2\)	clip term	good token pushed too far, saturated
C	\(+1\)	\(0.5\) (below lower bound)	\(0.5\)	\(0.8\)	raw term	good token being pushed down, correction allowed
D	\(-1\)	\(0.5\) (below lower bound)	\(-0.5\)	\(-0.8\)	clip term	bad token already suppressed enough, saturated
E	\(-1\)	\(1.5\) (above upper bound)	\(-1.5\)	\(-1.2\)	raw term	bad token being pushed up, correction allowed

The pattern:

Going too far in the right direction (B, D) → min picks the clip term, gradient saturates
Going in the wrong direction (C, E) → min picks the raw term, gradient preserved, correction continues

Schulman calls this the pessimistic bound: take the smaller value as a conservative estimate. Two birds one stone: restrict single-step magnitude (B, D) without blocking error correction (C, E).

3.5 KL Penalty: Long-term Anchor Against Drift

Clip handles single-step stability. But even if every step stays within \([1-\epsilon, 1+\epsilon]\), accumulation over many iterations may still drift \(\pi_\theta\) to a completely different distribution. The classic failure mode is reward hacking: the model discovers a generation pattern that reliably gets high reward while language quality collapses; clip never triggers at any single step, but the cumulative drift is catastrophic.

PPO adds the KL term:

\[\beta \cdot D_{KL}\big[\pi_\theta(\cdot \mid s_t) \| \pi_{ref}(\cdot \mid s_t)\big]\]

\(\pi_{ref}\) is the policy at the start of training (usually the post-SFT model), frozen throughout RL. It is not the same as \(\pi_{old}\):

Dimension	Clip on \(r_t = \pi_\theta / \pi_{old}\)	KL on \(\pi_\theta \| \pi_{ref}\)
Comparison target	policy at last rollout	policy at training start
Time scale	single iteration	entire training run
Form	hard constraint (saturating clip)	soft constraint (continuous penalty)
Problem solved	over-aggressive single step	cumulative drift

The two are complementary, not redundant. Clip only, no KL → long-run reward hacking. KL only, no clip → a single big update can crash training.

The k3 estimator. Token-level KL by definition iterates over the whole vocabulary, which is expensive. In practice people use Schulman’s k3⁴:

\[\text{KL}_{k3} = \exp(\log\pi_{ref} - \log\pi_\theta) - (\log\pi_{ref} - \log\pi_\theta) - 1\]

Strictly speaking, k3 is a low-variance unbiased estimator of \(D_{KL}[p \| q]\) when samples come from \(p\). In PPO we actually sample from \(\pi_{old}\) (because the rollout uses \(\pi_{old}\)), so this is technically an approximation. But \(\pi_{old} \approx \pi_\theta\) under the clip constraint, and in practice the approximation is good enough.

One-liner in code:

log_ratio = ref_logprobs - new_logprobs
kl = torch.exp(log_ratio) - log_ratio - 1  # k3 estimator

3.6 Full Code

def compute_ppo_loss(
    logits,          # [B, T, V] current policy outputs
    old_logprobs,    # [B, T] log prob from pi_old (detached)
    ref_logprobs,    # [B, T] log prob from pi_ref (frozen)
    actions,         # [B, T] sampled tokens
    advantages,      # [B, T] from GAE (Section 4)
    action_mask,     # [B, T] 1 for policy-generated positions
    clip_eps=0.2,
    kl_coef=0.001,
):
    log_probs = F.log_softmax(logits, dim=-1)
    new_logprobs = log_probs.gather(-1, actions.unsqueeze(-1)).squeeze(-1)

    # === importance ratio (3.2) ===
    ratio = torch.exp(new_logprobs - old_logprobs)

    # === clip + min (3.3, 3.4) ===
    surr1 = ratio * advantages
    surr2 = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps) * advantages
    policy_loss = -torch.min(surr1, surr2)

    # === KL penalty (3.5, k3 estimator) ===
    log_ratio_ref = ref_logprobs - new_logprobs
    kl = torch.exp(log_ratio_ref) - log_ratio_ref - 1

    total = (policy_loss + kl_coef * kl) * action_mask
    return total.sum() / action_mask.sum()

3.7 Five Regularization Mechanisms at a Glance

Mechanism	Target	Problem it solves
Importance ratio	data / policy mismatch	allows rollouts to be reused
Clip on ratio	single-token single-step update	single step drifting too far from \(\pi_{old}\)
Min on clip	side effects of clip	preserves correction gradient in the wrong direction
KL on \(\pi_{ref}\)	cumulative drift	reward hacking
Value baseline	advantage estimation	variance reduction, credit assignment

PPO’s engineering stability is the sum of all five. Remove any one and some setting will start breaking.

4. Sparse Reward and Credit Assignment

4.1 Problem Definition

In LLM RL, reward typically only has a nonzero value at the end of the episode. A typical Search-R1 episode:

<think>...</think><search>...</search><information>...</information>
<think>...</think><search>...</search><information>...</information>
<think>...</think><answer>McComb, Mississippi</answer>   ← reward=1 only here

500 tokens, only the last one has a nonzero reward. Using the terminal reward as \(A_t\) for every token causes two problems: all tokens get equal weight (can’t distinguish the critical ones), and gradient variance is high (each episode contributes only one scalar signal).

This is the credit assignment problem.

4.2 Value Function

The standard fix introduces a value function \(V_\phi\):

\[V_\phi(s_t) = \mathbb{E}\Big[\sum_{k=0}^{T-t} \gamma^k r_{t+k}\Big]\]

The advantage is defined as:

\[A_t = R_t - V_\phi(s_t)\]

\(A_t > 0\): actual return exceeds expectation → good choice
\(A_t < 0\): below expectation → bad choice
\(A_t \approx 0\): not critical

The value function spreads the sparse terminal reward across every token position.

4.3 Monte Carlo vs TD

Two ways to estimate \(R_t\):

Method	Formula	Property
Monte Carlo	\(R_t^{MC} = \sum_{k=0}^{T-t} \gamma^k r_{t+k}\)	unbiased, high variance
TD	\(R_t^{TD} = r_t + \gamma V_\phi(s_{t+1})\)	biased, low variance

4.4 GAE: Exponential Interpolation of the Two

GAE⁵ uses \(\lambda \in [0, 1]\) to interpolate:

\[A_t^{GAE} = \sum_{l=0}^{T-t} (\gamma \lambda)^l \delta_{t+l}, \quad \delta_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)\]

Recursive form:

\[A_t^{GAE} = \delta_t + \gamma \lambda A_{t+1}^{GAE}\]

def compute_gae(rewards, values, gamma=1.0, lam=1.0):
    """
    rewards: [T] per-step rewards (mostly zero in LLM setting)
    values:  [T+1] value estimates; values[T] is terminal bootstrap
    """
    T = len(rewards)
    advantages = torch.zeros(T)
    gae = 0.0

    # iterate in reverse: GAE is an exponentially weighted sum of future deltas
    for t in reversed(range(T)):
        delta = rewards[t] + gamma * values[t + 1] - values[t]
        gae = delta + gamma * lam * gae
        advantages[t] = gae

    returns = advantages + values[:T]
    return advantages, returns

Search-R1 uses \(\lambda = \gamma = 1\). Under this setting the GAE recursion degenerates to \(A_t = \sum_l \delta_{t+l}\), which telescopes to \(R_t - V(s_t)\) — exactly the MC advantage.

4.5 Why Sparse Reward Works on LLMs

Pretraining provides a strong prior: RL doesn’t start from scratch — it only does fine-grained adjustment
Causal token structure: early good choices correlate highly with final success
Smoothing from the value function: spreads terminal reward to each position
Batch-size statistical power: Search-R1 uses batch = 512, gradient is relatively stable

5. Reward Source: Reward Model vs Verifier

5.1 Two Orthogonal Dimensions

The design space of LLM RL has two orthogonal axes:

Axis 1: Where does reward come from (optimize what)

Neural Reward Model (RM)
Rule-based Verifier
Generative Verifier

Axis 2: How to do credit assignment (propagate how)

Critic + GAE (PPO)
Group Normalization (GRPO)
Offline data (DPO⁶)

Reward source	Credit Assignment	Representative work
Neural RM	Critic + GAE	Classic RLHF (InstructGPT)
Rule Verifier	Critic + GAE	Search-R1
Rule Verifier	Group Norm	DeepSeek-R1⁷

5.2 Reward Model vs Critic: Two Concepts Frequently Confused

Dimension	Reward Model	Critic (Value Model)
What it evaluates	the absolute quality of the full response	the expected future return of the current state
Input	(prompt, response)	current state \(s_t\)
Output	a single scalar \(r\)	one \(V_\phi(s_t)\) per position
Training signal	human preference data (or rules)	bootstrap from reward
When trained	before RL, frozen during RL	trained together with the policy

Common misconception: “with an RM you don’t need a Critic.” In fact the two have completely different jobs: the RM compresses (prompt, response) into a sparse scalar; the Critic spreads that scalar across each token. One tells you “good or bad overall,” the other tells you “good where, bad where.”

5.3 Why Reasoning Loves Rule-based Verifiers

Four reasons:

1. Hack-resistant. The classic failure of a neural RM: policy learns to “fool” the RM. A rule-based verifier is deterministic — it cannot be hacked.

2. Scalable. Training an RM needs a lot of annotation. Writing a verifier function takes a few lines:

def math_verifier(pred, gt):
    return sympy.simplify(parse(pred) - parse(gt)) == 0

def code_verifier(pred, tests):
    return all(run_test(pred, t) for t in tests)

3. Signal is more direct. An RM is a proxy indicator; the verifier is the true target. The more direct the training objective, the better RL converges.

4. Matches the new outcome-only reward paradigm. DeepSeek-R1 showed that outcome reward alone can elicit complex reasoning. Rule-based verifiers naturally fit this paradigm.

5.4 Limitations of Verifiers

Not every task admits a good verifier:

Task	Easy to write a verifier?
Math	Yes, easy
Code	Yes, easy
Short-answer QA	Yes, moderate
Long-form writing	No, hard
Dialogue quality	No, hard

This explains why the breakthroughs of reasoning LLMs are concentrated in math / code / QA — they have natural verifiers.

6. GRPO: The Trade-off of Dropping the Critic

GRPO⁸ is proposed by DeepSeek and is the RL algorithm used by DeepSeek-R1. Core innovation: drop the Critic.

6.1 Motivation

In LLM settings, the Critic is not cheap: it is typically the same scale as the policy and needs its own forward / backward, and its value estimates are inaccurate at cold-start. When reward is already outcome-level (one scalar per response), the token-level granularity a Critic provides may not be worth the compute.

6.2 Group Normalization

Sample \(G\) responses for the same prompt, and compute advantage via intra-group normalization:

\[\hat{A}_i = \frac{r_i - \text{mean}(r_1, ..., r_G)}{\text{std}(r_1, ..., r_G)}\]

def compute_grpo_advantages(rewards):
    """
    rewards: [G] rewards for G responses to the same prompt
    """
    mean = rewards.mean()
    std = rewards.std() + 1e-8
    return (rewards - mean) / std

Intuition: among \(G\) responses to the same prompt, a response is “good” if it scores above the group mean. You don’t need the Critic’s absolute estimate — relative comparison within the group is enough.

6.3 Trade-off

Benefits:

No Critic, about half the compute
Simpler training pipeline
Reasoning tasks already have outcome-level reward; token-level advantage isn’t needed

Costs:

Token-level information lost: can’t distinguish “which tokens are critical within the same response”
Depends on \(G\) being large enough: GRPO does not work at \(G=1\)
Reward collapse risk: late in training, when all \(G\) responses are correct (or all wrong), std → 0

Search-R1’s PPO vs GRPO comparison (Qwen2.5-7B):

Method	Convergence speed	Stability	Final avg EM
PPO	slow (Critic needs warmup)	stable	0.431
GRPO	fast	tends to collapse late	0.350–0.396

6.4 How to Choose

Reasoning + short training: GRPO. reward is already outcome-level
Long training + chasing peak performance: PPO. the Critic pays for itself
Compute-constrained: GRPO. saves a model

7. DPO: Skipping the Reward Model and RL Altogether

DPO⁶ (Direct Preference Optimization) takes a completely different path: no reward model, no RL rollout. Preference pairs \((x, y_w, y_l)\) (where \(y_w\) is the chosen response and \(y_l\) the rejected one) are treated as plain supervised-learning data.

7.1 Motivation

Classic RLHF has a long pipeline: preference data → train RM → PPO rollout → policy update. Each step has its own failure modes: RM gets hacked, PPO is unstable, rollouts are expensive. The DPO question is: can we short-circuit this chain?

7.2 Key Derivation: The RLHF Optimum Has a Closed Form

The RLHF objective:

\[\max_\pi \mathbb{E}_{x, y \sim \pi}[r(x, y)] - \beta D_{KL}[\pi \| \pi_{ref}]\]

This problem has a closed-form optimal solution:

\[\pi^*(y \mid x) = \frac{1}{Z(x)} \pi_{ref}(y \mid x) \exp\Big(\frac{1}{\beta} r(x, y)\Big)\]

where \(Z(x) = \sum_y \pi_{ref}(y \mid x) \exp(r(x, y)/\beta)\) is the normalization constant. Solving for the reward:

\[r(x, y) = \beta \log \frac{\pi^*(y \mid x)}{\pi_{ref}(y \mid x)} + \beta \log Z(x)\]

Key observation: reward can be written as a “log probability ratio.” This is the pivot of the entire DPO derivation.

7.3 From Preference Probability to the DPO Loss

Plug into the Bradley-Terry preference model:

\[P(y_w \succ y_l \mid x) = \sigma\big(r(x, y_w) - r(x, y_l)\big)\]

The subtraction kills the intractable \(\log Z(x)\) term. What remains only involves log probability ratios of the policy and the reference:

\[\mathcal{L}^{DPO} = -\mathbb{E}_{(x, y_w, y_l)}\Big[\log \sigma\Big(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{ref}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{ref}(y_l \mid x)}\Big)\Big]\]

This is just a binary classification loss (logistic regression on preference pairs) — no reward model, no rollout, no PPO.

7.4 Code

def compute_dpo_loss(
    policy_chosen_logps,    # [B] log prob of y_w under pi_theta (sequence-level)
    policy_rejected_logps,  # [B] log prob of y_l under pi_theta
    ref_chosen_logps,       # [B] log prob of y_w under pi_ref
    ref_rejected_logps,     # [B] log prob of y_l under pi_ref
    beta=0.1,
):
    # log ratio of policy to reference for both chosen and rejected
    chosen_ratio = policy_chosen_logps - ref_chosen_logps
    rejected_ratio = policy_rejected_logps - ref_rejected_logps

    # how much more the policy prefers y_w over y_l, scaled by beta
    logits = beta * (chosen_ratio - rejected_ratio)

    # binary classification: push chosen above rejected
    return -F.logsigmoid(logits).mean()

Implementation detail: logps is the sum of log probabilities over the whole response (sequence-level), not per-token. This is the fundamental difference from PPO’s token-level advantage.

7.5 DPO vs PPO: Trade-off

Dimension	PPO	DPO
Data source	online rollout	offline preference pairs
Reward model	required	not required
Training stability	need to tune clip, KL, GAE, etc.	almost as simple as SFT
Data efficiency	rollouts are expensive	preferences collected once
OOD generalization	good (rollout enables exploration)	bounded by preference coverage
Granularity	token-level	sequence-level
Best fit	verifier available or online interaction	static preference data available

Key insight: DPO reduces RL to supervised learning. The cost is losing “online exploration” — the policy can only learn from a static set of preference pairs and cannot self-improve via new rollouts. This is a genuine disadvantage for OOD generalization and reasoning tasks.

7.6 When to Choose DPO

Preference data already exists and rollouts are expensive: engineering-friendly substitute for classic RLHF
Tasks like dialogue style or writing preference where verifiers are hard to write
Engineering simplicity: no desire to maintain an RM + PPO stack

Not a good fit:

Reasoning tasks with rule verifiers: online exploration in PPO / GRPO is stronger
Agentic tasks requiring multi-turn environment interaction: DPO has no concept of rollout and cannot handle this

DPO is complementary to PPO / GRPO, not a replacement. With this in mind, when you get to Search-R1 you won’t ask “why not use DPO?” — Search-R1 fundamentally requires rollouts and environment interaction.

8. Search-R1: A Complete Instance of Agentic RL

All the pieces above come together in this section.

8.1 Task Formalization

A Search-R1 rollout is an alternating interaction between the LLM and a search engine:

<think>reasoning about what to search</think>
<search>query string</search>
<information>[retrieved passages injected by environment]</information>
<think>reasoning about retrieved info</think>
<search>follow-up query</search>
<information>[more retrieved passages]</information>
...
<answer>final answer</answer>

The template constrains only structure, not strategy — RL itself discovers the search policy.

8.2 Rollout Process

def rollout_with_search(model, tokenizer, search_engine, prompt, max_turns=4):
    """
    Search-R1 rollout: interleave LLM generation with search engine calls.
    """
    tokens = tokenizer.encode(prompt)
    action_mask = [0] * len(tokens)  # prompt tokens are not LLM-generated

    for turn in range(max_turns):
        new_tokens, stop = model.generate(
            tokens,
            stop_sequences=["</search>", "</answer>"],
            max_new_tokens=500,
        )
        tokens.extend(new_tokens)
        action_mask.extend([1] * len(new_tokens))       # LLM-generated

        if stop == "</answer>":
            break
        elif stop == "</search>":
            query = extract_search_query(tokens)
            retrieved = search_engine.retrieve(query, top_k=3)
            retrieved_text = f"<information>{format_passages(retrieved)}</information>"
            retrieved_tokens = tokenizer.encode(retrieved_text)
            tokens.extend(retrieved_tokens)
            action_mask.extend([0] * len(retrieved_tokens))  # NOT LLM-generated

    return tokens, action_mask

8.3 Core Technical Contribution: Retrieved Token Masking

Problem: the rollout sequence mixes in tokens not produced by the LLM (retrieved passages). If you don’t handle them, policy gradient will compute gradients on those tokens too — which is wrong: those tokens are not actions of the policy.

Worse, failing to mask lets the loss treat “the stylistic features of search-engine output” as something the policy should imitate. The policy ends up learning to mimic the retrieval text style instead of learning “how to search.”

Solution: only compute loss over LLM-generated tokens:

\[\mathcal{L}^{Search\text{-}R1} = \frac{\sum_t I(y_t) \cdot \mathcal{L}^{PPO}_t}{\sum_t I(y_t)}, \quad I(y_t) = \begin{cases} 1 & y_t \text{ is LLM-generated} \\ 0 & y_t \text{ is retrieved} \end{cases}\]

In implementation, action_mask just gets passed into the PPO loss:

def search_r1_loss(logits, old_logprobs, ref_logprobs, actions,
                   advantages, action_mask, clip_eps=0.2, kl_coef=0.001):
    """
    Identical to standard PPO loss, except action_mask now also
    excludes retrieved tokens in addition to padding.
    """
    log_probs = F.log_softmax(logits, dim=-1)
    new_logprobs = log_probs.gather(-1, actions.unsqueeze(-1)).squeeze(-1)

    ratio = torch.exp(new_logprobs - old_logprobs)
    surr1 = ratio * advantages
    surr2 = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps) * advantages
    policy_loss = -torch.min(surr1, surr2)

    log_ratio_ref = ref_logprobs - new_logprobs
    kl = torch.exp(log_ratio_ref) - log_ratio_ref - 1

    # action_mask = 0 for padding OR retrieved tokens
    total = (policy_loss + kl_coef * kl) * action_mask
    return total.sum() / action_mask.sum()

Effect (Qwen2.5-7B, avg EM over 7 QA datasets):

Setting	Avg EM
Do not mask retrieved tokens	0.343
Mask retrieved tokens	0.431

This single change gives ~25% relative improvement. The right definition of the action space can have a larger effect on RL training than the algorithmic changes themselves.

More abstractly: RL should only optimize variables the agent controls. Retrieved content is the environment’s response — the agent doesn’t control it, so it shouldn’t be shaped by policy gradient. This is a basic RL principle; it is easy to overlook in “pure LLM generation” settings, and Agentic RL makes it matter again.

8.4 Experimental Observations

Main results (avg EM over 7 QA datasets, Qwen2.5-7B):

Method	Avg EM
Direct Inference	0.181
RAG	0.304
R1 (RL without search)	0.276
Rejection Sampling + SFT	0.348
Search-R1 (PPO)	0.431

Emergent behaviors: the trained model exhibits patterns not explicitly taught by the paper:

Self-verification: after reaching an answer, issue one more query to confirm
Query refinement: if the first search fails, rewrite the query and retry
Problem decomposition: complex questions get automatically decomposed into sub-queries

Table 9 of the paper shows a case: asking “In which city and state was the singer of the perfume Curious born?” Search-R1 proceeds as:

<search>Curious fragrance information</search> → finds it’s Britney Spears’s
<search>Britney Spears birthplace</search> → McComb, Mississippi
<search>McComb, Mississippi location</search> → verification
<answer>McComb, Mississippi</answer>

Compare R1 without search: it guesses “Houston” (wrongly attributing Curious to Beyoncé).

9. A Panoramic Summary of the Design Space

9.1 Two Orthogonal Dimensions

                    │ Critic + GAE      │ Group Norm         │
                    │ (token-level adv) │ (response-level)   │
────────────────────┼───────────────────┼────────────────────┤
  Neural RM         │ classic RLHF      │ uncommon           │
────────────────────┼───────────────────┼────────────────────┤
  Rule Verifier     │ Search-R1         │ DeepSeek-R1        │
────────────────────┼───────────────────┼────────────────────┤
  Gen Verifier      │ emerging          │ emerging           │

The horizontal axis governs cost and stability. Critic + GAE needs one more model but is more stable; Group Norm saves a model at the cost of coarser granularity.

The vertical axis governs applicable scope. Neural RM covers more tasks but can be hacked; rule verifiers are hack-resistant but cover less; generative verifiers sit between the two.

9.2 The Essence of Agentic RL

Agentic RL is not a new algorithm — it is an extension of the environment. The environment of traditional RL for LLM has only prompt and terminal reward; the environment of Agentic RL also includes tools, intermediate feedback, and state changes.

The algorithm layer is still PPO or GRPO. The key change is in the rollout — no longer a one-shot generation but alternating interaction between policy and environment. This brings two new challenges:

Non-policy tokens mixed into the rollout sequence → need retrieved token masking
Higher rollout cost → higher bar for sample efficiency

9.3 Current Limitations and Open Problems

Reward design for open-ended tasks: rule verifiers only work for tasks with clear right and wrong. Writing, dialogue, and similar tasks still need neural RMs or human feedback.
Long-horizon reasoning: when the reasoning chain runs to tens of thousands of tokens, both Critic and Group Norm face challenges.
Multi-tool coordination: how to attribute reward across tools, how to manage the rollout of complex tool-call graphs — both still open.
Theoretical explanation of emergent behaviors under outcome reward: why do behaviors like self-verification emerge spontaneously? Under what conditions? No systematic answer yet.

10. Closing and References

Quick Index

Cross Entropy                  -log π(y_t)              per-token base loss
SFT Loss                       Σ CE                     unweighted
REINFORCE                      Σ CE · A_t               advantage-weighted
PPO                            + clip + min + KL        stabilized policy gradient
GRPO                           + group norm             drops the Critic
DPO                            logistic on pref pairs   offline path skipping RL
Search-R1                      + retrieved token mask   Agentic RL instance

Policy (π_θ)                   LLM being trained
Reference (π_ref)              LLM at training start, frozen
Old (π_old)                    snapshot at this rollout
Critic (V_φ)                   estimates future return
Reward Model                   scalar on (prompt, response)
Verifier                       general concept: function/model judging correctness

前言

过去一年 reasoning LLM 的主要进展——o1、DeepSeek-R1、Search-R1——共享一套 RL 技术基础，却在几个关键设计点上做了不同选择。把这些选择并列对比，比单独读任一篇论文都更能看清设计空间。

这篇文章做的就是这件事。主线是 Search-R1¹：它是 Agentic RL 的一个清晰实例，又足够简单，能把 PPO、GRPO、verifier、token masking 这些概念自然串起来。

假设读者熟悉 transformer、SFT 和基础 DL。不假设熟悉 RL 细节。

不覆盖：RLHF 偏好数据构造、inference-time scaling（best-of-N、tree search）、multi-agent、硬件工程优化。

1. 问题设定：post-training 的技术栈演进

从 pretrained LLM 走到能自主调用工具的 agent，中间经过几次技术跳跃。每一跳都在解决上一跳的局限：

Pretrain
   │ 知道很多，但不听话
   ↓
SFT (Supervised Fine-Tuning)
   │ 只会模仿，不会主动抑制错误
   ↓
RLHF (PPO + Reward Model)
   │ RM 易被 hack；偏好数据贵；reasoning 信号弱
   ↓
Reasoning RL (GRPO + Verifier)
   │ reward 只从模型自己的输出来，无法与外界交互
   ↓
Agentic RL (+ Tool / Environment)

每一跳的技术内核：

SFT：cross entropy + 监督学习
RLHF：PPO + 神经网络 reward model
Reasoning RL：GRPO + rule-based verifier，砍掉 reward model
Agentic RL：在 Reasoning RL 基础上，把环境交互嵌入 rollout

“Agentic RL” 并不是一个独立的新算法，而是”在 rollout 中引入环境交互”的训练范式。算法层仍然是 PPO 或 GRPO。关键变化在 环境的扩展：rollout 中间会插入工具调用，执行或检索的结果被注入到 sequence，成为后续 generation 的 context。

Search-R1 是这个范式最干净的实例：把搜索引擎作为 environment，让 LLM 在 reasoning 过程中自主决定何时搜、搜什么。

在这条线性主线之外还有一条支线：DPO 把 RLHF 直接化成监督学习，不跑 RL、不训 reward model。§7 会单独讨论。

下面的章节依次展开每个组件。

2. 从 Cross Entropy 到 Policy Gradient

本节建立 RL loss 和 SFT loss 的等价关系，作为后面理解 PPO 的地基。

2.1 Cross Entropy 在 next-token prediction 下的化简

一般定义：

\[H(p, q) = -\sum_i p_i \log q_i\]

next-token 场景下真实标签 \(p\) 是 one-hot，\(p_{y_t} = 1\)、其他为 0。对 vocab 求和只剩一项：

\[H(p, q) = -\log \pi_\theta(y_t \mid y_{<t})\]

SFT 对整个序列求和：

\[\mathcal{L}^{SFT} = -\sum_{t=1}^{T} \log \pi_\theta(y_t \mid y_{<t})\]

2.2 CE 和 KL 的细微之处

两者的恒等式：

\[H(p, q) = H(p) + D_{KL}(p \| q)\]

监督学习里 \(p\) 是 ground truth、固定不变，\(H(p)\) 对 \(\theta\) 是常数——所以 优化 CE 和优化 KL 在梯度上等价。我们默认用 CE 仅仅是因为它更便宜。

在 RL 里这个等价会失效：KL 的两边都是模型（\(\pi_\theta\) vs \(\pi_{ref}\)），都不固定。这时 CE 和 KL 的差别才真正重要。后面 PPO 的 KL penalty 会用到这一点。

2.3 从 SFT 到 REINFORCE

SFT loss 对每个 token 都在”无条件推高它的概率”——demo data 里的 token 被默认是好的。在 imitation learning 里这个假设合理，但 RL 想做的更多：既要推高好行为，也要主动压低坏行为。

REINFORCE 在 SFT loss 上乘一个权重 \(A_t\)：

\[\mathcal{L}^{REINFORCE} = -\mathbb{E}\Big[\sum_t \log \pi_\theta(y_t \mid y_{<t}) \cdot A_t\Big]\]

\(A_t > 0\)：好 token，推高概率
\(A_t < 0\)：差 token，压低概率
\(\lvert A_t \rvert\) 越大，调整幅度越大

三个 loss 的统一形式：

Loss	单 token 形式	权重含义
SFT	\(-\log \pi_\theta(y_t)\)	1（所有 token 都推高）
REINFORCE	\(-\log \pi_\theta(y_t) \cdot A_t\)	advantage 加权
PPO	\(-\min(r_t A_t, \text{clip}(r_t) A_t)\)	ratio 替代 log（下节解释）

核心洞察：RL loss 在形式上就是”带符号、加权的 cross entropy”。SFT 在 \(A_t \equiv 1\) 时和 RL 在 梯度形式 上重合。需注意这只是形式上的类比——SFT 采样自 demo data，RL 采样自 \(\pi_\theta\)，两者的样本分布并不相同。

2.4 代码对比

# SFT loss
def sft_loss(logits, actions, mask):
    log_probs = F.log_softmax(logits, dim=-1)
    token_logprobs = log_probs.gather(-1, actions.unsqueeze(-1)).squeeze(-1)
    return -(token_logprobs * mask).sum() / mask.sum()

# REINFORCE loss: only differs by * advantages
def reinforce_loss(logits, actions, advantages, mask):
    log_probs = F.log_softmax(logits, dim=-1)
    token_logprobs = log_probs.gather(-1, actions.unsqueeze(-1)).squeeze(-1)
    return -(token_logprobs * advantages * mask).sum() / mask.sum()

两者差别就一个 * advantages。正是这个乘法打开了 SFT 和 RL 的分野：RL 能用负权重抑制错误行为，SFT 不能。

3. PPO 的设计动机与每个组件的作用

PPO² 是当前 LLM RL 训练的事实标准。本节按”组件 → 解决什么问题”的顺序拆开。

下面写的是 LLM RLHF 版 PPO，比原 PPO 论文多了一项 \(\pi_{ref}\) 的 KL 惩罚（这是 InstructGPT³ 起带进来的工程惯例，不是原始 PPO 的一部分）：

\[\mathcal{L}^{PPO}(\theta) = -\mathbb{E}_t\Big[\min\big(r_t(\theta) A_t,\ \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t\big)\Big] + \beta \cdot D_{KL}\big[\pi_\theta \| \pi_{ref}\big]\]

3.1 起点：REINFORCE 的两个病

病一：方差大，单步更新可能崩。 \(A_t\) 从单条 trajectory 估，方差大。偶然的高 reward 样本让某些 token 的 \(A_t\) 异常大，一次 gradient step 就能把 \(\pi_\theta\) 推到奇怪位置。

病二：样本效率太低。 REINFORCE 严格 on-policy：每次更新后旧 rollout 作废。对 LLM 这种 rollout 极贵的场景不可接受。

PPO 的每个组件都在治这两个病。

3.2 Importance Ratio：让一批 rollout 多用几次

\[r_t(\theta) = \frac{\pi_\theta(y_t \mid y_{<t})}{\pi_{old}(y_t \mid y_{<t})}\]

\(\pi_{old}\) 是采样这批 rollout 时的 policy 快照。训练循环变成：

for iteration in range(N):
    # snapshot current policy
    pi_old = copy(pi_theta).detach()

    # collect rollouts using pi_old
    rollouts = sample_rollouts(pi_old, batch_size=B)
    advantages = compute_gae(rollouts)

    # multiple gradient updates on the same rollouts
    for epoch in range(K):  # typically K = 2~4
        for minibatch in split(rollouts):
            loss = ppo_loss(pi_theta, pi_old, minibatch, advantages)
            loss.backward()
            optimizer.step()

同批 rollout 被用 \(K\) 次。更新到第 \(k\) 个 epoch 时 \(\pi_\theta \neq \pi_{old}\)，这就是 off-policy 的。Importance ratio 在数学上修正这个 mismatch（importance sampling 的标准应用）。

符号陷阱：这里 \(r_t\) 是 importance ratio，不是 reward。Reward 在 PPO 中通过 GAE 化成 \(A_t\)，loss 里根本没有 reward 出现。

3.3 Clip：单步幅度约束

只有 ratio 还不够。如果 \(\pi_\theta\) 偏离 \(\pi_{old}\) 太远（比如 \(r_t = 5\)），importance sampling 的方差会爆炸。Clip 硬压回来：

\[\text{clip}(r_t, 1-\epsilon, 1+\epsilon)\]

典型 \(\epsilon = 0.2\)，policy 在单个 token 上的概率比例变化不超过 ±20%。

但单独用 clip 有坑。设 \(A_t = +1\)（好 token），\(r_t = 0.5\)（policy 反而压低了它——这是错误状态）。如果直接用 \(\text{clip}(r_t) \cdot A_t = 0.8\)，梯度被 人为放大，反而鼓励 policy 继续走错方向。

3.4 Min：让 clip 只刹车、不阻止纠错

PPO 的完整 policy 项：

\[L_t^{CLIP} = \min\big(r_t A_t,\ \text{clip}(r_t, 1-\epsilon, 1+\epsilon) \cdot A_t\big)\]

设 \(\epsilon = 0.2\)，四种情况：

情况	\(A_t\)	\(r_t\)	\(r_t A_t\)	\(\text{clip}(r_t) A_t\)	min 选谁	含义
B	\(+1\)	\(1.5\)（超上界）	\(1.5\)	\(1.2\)	clip 项	好 token 推过头，饱和
C	\(+1\)	\(0.5\)（低于下界）	\(0.5\)	\(0.8\)	原项	好 token 反被压低，允许纠正
D	\(-1\)	\(0.5\)（低于下界）	\(-0.5\)	\(-0.8\)	clip 项	坏 token 压够了，饱和
E	\(-1\)	\(1.5\)（超上界）	\(-1.5\)	\(-1.2\)	原项	坏 token 反被推高，允许纠正

规律：

朝 正确方向 走过头（B, D）→ min 选 clip 项，梯度饱和
朝 错误方向 走（C, E）→ min 选原项，梯度保留，继续纠错

Schulman 称此为 pessimistic bound：取较小者做保守估计。一举两得：限制单步幅度（B, D），但不阻断错误修复（C, E）。

3.5 KL Penalty：长期防漂移

Clip 管单步稳定性。但即使每步都在 \([1-\epsilon, 1+\epsilon]\)，多 iteration 累积仍可能让 \(\pi_\theta\) 漂到完全不同的分布。典型失败模式是 reward hacking：模型发现某种生成模式能稳定拿高 reward，但语言质量崩坏；每步 clip 都不触发，但累积后 policy 已面目全非。

PPO 加 KL 项：

\[\beta \cdot D_{KL}\big[\pi_\theta(\cdot \mid s_t) \| \pi_{ref}(\cdot \mid s_t)\big]\]

\(\pi_{ref}\) 是 训练起点 的 policy（通常是 SFT 后的模型），整个 RL 过程冻结。它和 \(\pi_{old}\) 不是一回事：

维度	Clip on \(r_t = \pi_\theta / \pi_{old}\)	KL on \(\pi_\theta \| \pi_{ref}\)
比较对象	上一轮 rollout 时的 policy	训练起点的 policy
时间尺度	单 iteration	整个训练过程
形式	硬约束（饱和截断）	软约束（连续惩罚）
解决问题	单步更新过激	累积漂移

两者 互补，不可替代。只有 clip、没有 KL，长期会 reward hacking；只有 KL、没有 clip，单次大更新可能直接崩。

k3 estimator。Token 级 KL 按定义要遍历整个 vocab，开销大。实际用 Schulman 的 k3⁴：

\[\text{KL}_{k3} = \exp(\log\pi_{ref} - \log\pi_\theta) - (\log\pi_{ref} - \log\pi_\theta) - 1\]

严格讲 k3 是 \(D_{KL}[p \| q]\) 在 “从 \(p\) 采样” 时的低方差无偏估计。PPO 中我们实际从 \(\pi_{old}\) 采样（因为 rollout 是用 \(\pi_{old}\) 跑的），严格来说是一个近似。但 \(\pi_{old} \approx \pi_\theta\)（在 clip 的约束下），实践中这个近似足够好。

代码一行：

log_ratio = ref_logprobs - new_logprobs
kl = torch.exp(log_ratio) - log_ratio - 1  # k3 estimator

3.6 完整代码

def compute_ppo_loss(
    logits,          # [B, T, V] current policy outputs
    old_logprobs,    # [B, T] log prob from pi_old (detached)
    ref_logprobs,    # [B, T] log prob from pi_ref (frozen)
    actions,         # [B, T] sampled tokens
    advantages,      # [B, T] from GAE (Section 4)
    action_mask,     # [B, T] 1 for policy-generated positions
    clip_eps=0.2,
    kl_coef=0.001,
):
    log_probs = F.log_softmax(logits, dim=-1)
    new_logprobs = log_probs.gather(-1, actions.unsqueeze(-1)).squeeze(-1)

    # === importance ratio (3.2) ===
    ratio = torch.exp(new_logprobs - old_logprobs)

    # === clip + min (3.3, 3.4) ===
    surr1 = ratio * advantages
    surr2 = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps) * advantages
    policy_loss = -torch.min(surr1, surr2)

    # === KL penalty (3.5, k3 estimator) ===
    log_ratio_ref = ref_logprobs - new_logprobs
    kl = torch.exp(log_ratio_ref) - log_ratio_ref - 1

    total = (policy_loss + kl_coef * kl) * action_mask
    return total.sum() / action_mask.sum()

3.7 五个正则机制一览

机制	作用对象	解决问题
Importance ratio	data / policy mismatch	让 rollout 多次复用
Clip on ratio	单 token 单步更新	单次偏离 \(\pi_{old}\) 过远
Min on clip	clip 的副作用	错误方向保留纠错梯度
KL on \(\pi_{ref}\)	累积漂移	reward hacking
Value baseline	advantage 估计	降方差、实现 credit assignment

PPO 的工程稳定性是这五件事叠加的结果。去掉其中任何一个，某种 setting 下训练会变脆。

4. 稀疏 Reward 与 Credit Assignment

4.1 问题定义

LLM RL 中 reward 通常只在 episode 末尾有值。Search-R1 典型 episode：

<think>...</think><search>...</search><information>...</information>
<think>...</think><search>...</search><information>...</information>
<think>...</think><answer>McComb, Mississippi</answer>   ← reward=1 只在这里

500 个 token，只有最后一位有非零 reward。直接把 terminal reward 当作所有 token 的 \(A_t\) 会有两个问题：所有 token 等权（无法区分关键 token），梯度方差大（每条 episode 只贡献一个标量 signal）。

这就是 credit assignment 问题。

4.2 Value Function

标准解法是引入 value function \(V_\phi\)：

\[V_\phi(s_t) = \mathbb{E}\Big[\sum_{k=0}^{T-t} \gamma^k r_{t+k}\Big]\]

advantage 定义为：

\[A_t = R_t - V_\phi(s_t)\]

\(A_t > 0\)：实际回报超预期 → 好选择
\(A_t < 0\)：低于预期 → 差选择
\(A_t \approx 0\)：不关键

value function 把稀疏的 terminal reward 摊平到每个 token 位置上。

4.3 Monte Carlo vs TD

\(R_t\) 的两种估计方式：

方法	公式	性质
Monte Carlo	\(R_t^{MC} = \sum_{k=0}^{T-t} \gamma^k r_{t+k}\)	无偏、方差高
TD	\(R_t^{TD} = r_t + \gamma V_\phi(s_{t+1})\)	有偏、方差低

4.4 GAE：两者的指数加权混合

GAE⁵ 用 \(\lambda \in [0, 1]\) 插值：

\[A_t^{GAE} = \sum_{l=0}^{T-t} (\gamma \lambda)^l \delta_{t+l}, \quad \delta_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)\]

递推形式：

\[A_t^{GAE} = \delta_t + \gamma \lambda A_{t+1}^{GAE}\]

def compute_gae(rewards, values, gamma=1.0, lam=1.0):
    """
    rewards: [T] per-step rewards (mostly zero in LLM setting)
    values:  [T+1] value estimates; values[T] is terminal bootstrap
    """
    T = len(rewards)
    advantages = torch.zeros(T)
    gae = 0.0

    # iterate in reverse: GAE is an exponentially weighted sum of future deltas
    for t in reversed(range(T)):
        delta = rewards[t] + gamma * values[t + 1] - values[t]
        gae = delta + gamma * lam * gae
        advantages[t] = gae

    returns = advantages + values[:T]
    return advantages, returns

Search-R1 用 \(\lambda = \gamma = 1\)。此时 GAE 递推退化成 \(A_t = \sum_l \delta_{t+l}\)，telescope 后等于 \(R_t - V(s_t)\)——就是 MC advantage。

4.5 为什么稀疏 reward 在 LLM 上能 work

预训练提供 strong prior：RL 不是从零学，只是 fine-grained 调整
Token 的因果结构：早期好选择和最终成功高度相关
Value function 的平滑：把 terminal reward 摊到每个位置
Batch size 的统计力量：Search-R1 batch = 512，梯度相对稳定

5. Reward 来源：Reward Model vs Verifier

5.1 两个正交维度

LLM RL 的设计空间有两个正交轴：

轴一：Reward 从哪来（optimize what）

Neural Reward Model（RM）
Rule-based Verifier
Generative Verifier

轴二：Credit Assignment 怎么做（propagate how）

Critic + GAE（PPO）
Group Normalization（GRPO）
Offline data（DPO⁶）

Reward 来源	Credit Assignment	代表工作
Neural RM	Critic + GAE	经典 RLHF（InstructGPT）
Rule Verifier	Critic + GAE	Search-R1
Rule Verifier	Group Norm	DeepSeek-R1⁷

5.2 Reward Model vs Critic：常被搞混的两个概念

维度	Reward Model	Critic (Value Model)
评什么	整条 response 的绝对质量	当前状态的未来期望回报
输入	(prompt, response)	当前 state \(s_t\)
输出	单个 scalar \(r\)	每个位置一个 \(V_\phi(s_t)\)
训练信号	人类偏好数据（或规则）	从 reward bootstrap
训练时机	RL 之前训好，RL 中冻结	和 policy 一起训

常见误解：”有 RM 就不需要 Critic”。事实上两者职责完全不同：RM 把 (prompt, response) 压成一个稀疏 scalar，Critic 把这个 scalar 摊平到每个 token。一个管”整体是好是坏”，一个管”好在哪里坏在哪里”。

5.3 为什么 Reasoning 爱用 Rule-based Verifier

四个原因：

1. 抗 reward hacking。 Neural RM 的经典问题：policy 学会 “骗过” RM。Rule-based verifier 是确定性的，无法被 hack。

2. 可扩展。 训 RM 要大量标注。写 verifier 函数只要几行：

def math_verifier(pred, gt):
    return sympy.simplify(parse(pred) - parse(gt)) == 0

def code_verifier(pred, tests):
    return all(run_test(pred, t) for t in tests)

3. 信号更直接。 RM 是 proxy indicator，verifier 是真实目标。训练目标越直接，RL 越好收敛。

4. 匹配 outcome-only reward 的新范式。 DeepSeek-R1 证明了只用 outcome reward 就能涌现复杂 reasoning。Rule-based verifier 天然适合。

5.4 Verifier 的局限

不是所有任务都有好的 verifier：

任务	Verifier 容易写吗
数学题	✅ 容易
代码	✅ 容易
短答 QA	✅ 中等
长文写作	❌ 难
对话质量	❌ 难

这解释了为什么 reasoning LLM 的突破集中在 math / code / QA——它们有天然 verifier。

6. GRPO：去掉 Critic 的 Trade-off

GRPO⁸ 由 DeepSeek 提出，是 DeepSeek-R1 用的 RL 算法。核心创新：去掉 Critic。

6.1 动机

LLM 场景下 Critic 不便宜：通常和 policy 同等规模，需要独立的 forward / backward，冷启动时 value 估计也不准。如果 reward 本来就是 outcome-level（一条 response 一个分），token-level Critic 的细粒度信号未必值这个计算开销。

6.2 Group Normalization

对同一个 prompt 采样 \(G\) 个 response，用 组内归一化 得 advantage：

\[\hat{A}_i = \frac{r_i - \text{mean}(r_1, ..., r_G)}{\text{std}(r_1, ..., r_G)}\]

def compute_grpo_advantages(rewards):
    """
    rewards: [G] rewards for G responses to the same prompt
    """
    mean = rewards.mean()
    std = rewards.std() + 1e-8
    return (rewards - mean) / std

直觉：同 prompt 下的 \(G\) 个 response，好的相对于组内均值就是好。不需要 Critic 给绝对评分，组内相对比较就够了。

6.3 Trade-off

好处：

✅ 省掉 Critic，计算量约减半
✅ 训练流程更简单
✅ Reasoning 任务 reward 本就 outcome-level，不需要 token-level advantage

代价：

❌ Token-level 信息丢失：无法区分”同一 response 中哪些 token 关键”
❌ 依赖 \(G\) 够大：\(G=1\) 时 GRPO 无法工作
❌ Reward collapse 风险：训练后期 \(G\) 个 responses 全对或全错时 std 趋零

Search-R1 的 PPO vs GRPO 对比（Qwen2.5-7B）：

方法	收敛速度	稳定性	最终 avg EM
PPO	慢（Critic 需 warmup）	稳定	0.431
GRPO	快	后期易崩	0.350–0.396

6.4 怎么选

Reasoning + 短训练：GRPO。reward 本就 outcome-level
长训练 + 追求最终性能：PPO。Critic 开销值得
计算资源紧张：GRPO。省一个模型

7. DPO：绕开 reward model 和 RL 的捷径

DPO⁶（Direct Preference Optimization）走了完全不同的路：不训 reward model，也不做 RL rollout。直接把偏好对 \((x, y_w, y_l)\)（\(y_w\) 是 chosen、\(y_l\) 是 rejected）当监督学习数据。

7.1 动机

经典 RLHF 训练链路很长：preference data → 训 RM → PPO rollout → policy update。每一步都有各自的失败模式：RM 被 hack、PPO 不稳、rollout 贵。DPO 的问题是：能不能把这条链短路？

7.2 核心推导：RLHF 的最优解有闭式

RLHF 的目标是：

\[\max_\pi \mathbb{E}_{x, y \sim \pi}[r(x, y)] - \beta D_{KL}[\pi \| \pi_{ref}]\]

这个问题有闭式最优解：

\[\pi^*(y \mid x) = \frac{1}{Z(x)} \pi_{ref}(y \mid x) \exp\Big(\frac{1}{\beta} r(x, y)\Big)\]

其中 \(Z(x) = \sum_y \pi_{ref}(y \mid x) \exp(r(x, y)/\beta)\) 是归一化常数。反解出 reward：

\[r(x, y) = \beta \log \frac{\pi^*(y \mid x)}{\pi_{ref}(y \mid x)} + \beta \log Z(x)\]

关键观察：reward 可以写成 “log 概率比” 的形式。这是 DPO 整个推导的支点。

7.3 从偏好概率到 DPO loss

代入 Bradley-Terry 偏好模型：

\[P(y_w \succ y_l \mid x) = \sigma\big(r(x, y_w) - r(x, y_l)\big)\]

两个 reward 相减，难算的 \(\log Z(x)\) 项消掉。剩下只包含 policy 和 reference 的 log 概率比：

这就是一个普通的二分类 loss（logistic regression on preference pairs），不需要 reward model、不需要 rollout、不需要 PPO。

7.4 代码

def compute_dpo_loss(
    policy_chosen_logps,    # [B] log prob of y_w under pi_theta (sequence-level)
    policy_rejected_logps,  # [B] log prob of y_l under pi_theta
    ref_chosen_logps,       # [B] log prob of y_w under pi_ref
    ref_rejected_logps,     # [B] log prob of y_l under pi_ref
    beta=0.1,
):
    # log ratio of policy to reference for both chosen and rejected
    chosen_ratio = policy_chosen_logps - ref_chosen_logps
    rejected_ratio = policy_rejected_logps - ref_rejected_logps

    # how much more the policy prefers y_w over y_l, scaled by beta
    logits = beta * (chosen_ratio - rejected_ratio)

    # binary classification: push chosen above rejected
    return -F.logsigmoid(logits).mean()

实现细节：logps 是整个 response 的 log 概率之和（sequence-level），不是单 token。这和 PPO token-level advantage 是本质区别。

7.5 DPO vs PPO：Trade-off

维度	PPO	DPO
数据来源	online rollout	offline preference pairs
Reward model	需要	不需要
训练稳定性	需调 clip、KL、GAE 等超参	几乎像 SFT 一样好训
数据效率	rollout 昂贵	偏好数据一次性收集
分布外泛化	好（rollout 中能探索）	依赖偏好数据覆盖
粒度	token-level	sequence-level
适用场景	有 verifier 或在线交互	有静态偏好数据

核心洞察：DPO 把 RL 问题化成了监督学习。代价是失去了 “online exploration”——policy 只能从静态偏好对学，无法从新的 rollout 自我改进。这在分布外泛化和 reasoning 任务上是实打实的 disadvantage。

7.6 什么时候选 DPO

偏好数据已有、rollout 贵：经典 RLHF 的工程替代品
对话风格、写作偏好等 “难写 verifier” 的任务
工程简单：不想维护 RM + PPO 基础设施

不适合：

有 rule verifier 的 reasoning 任务：PPO / GRPO 的在线探索更强
需要多轮环境交互的 Agentic 任务：DPO 没有 rollout 的概念，无法处理

DPO 和 PPO / GRPO 互为补充，不是替代。理解清楚这点，后面看 Search-R1 时就不会问 “为什么不用 DPO”——因为 Search-R1 本质需要 rollout 和环境交互。

8. Search-R1：Agentic RL 的完整实例

前面所有组件在这一节组装起来。

7.1 任务形式化

Search-R1 的 rollout 是 LLM 和搜索引擎的交替交互：

<think>reasoning about what to search</think>
<search>query string</search>
<information>[retrieved passages injected by environment]</information>
<think>reasoning about retrieved info</think>
<search>follow-up query</search>
<information>[more retrieved passages]</information>
...
<answer>final answer</answer>

Template 只约束结构，不约束策略——让 RL 自己发现好的搜索 policy。

7.2 Rollout 过程

def rollout_with_search(model, tokenizer, search_engine, prompt, max_turns=4):
    """
    Search-R1 rollout: interleave LLM generation with search engine calls.
    """
    tokens = tokenizer.encode(prompt)
    action_mask = [0] * len(tokens)  # prompt tokens are not LLM-generated

    for turn in range(max_turns):
        new_tokens, stop = model.generate(
            tokens,
            stop_sequences=["</search>", "</answer>"],
            max_new_tokens=500,
        )
        tokens.extend(new_tokens)
        action_mask.extend([1] * len(new_tokens))       # LLM-generated

        if stop == "</answer>":
            break
        elif stop == "</search>":
            query = extract_search_query(tokens)
            retrieved = search_engine.retrieve(query, top_k=3)
            retrieved_text = f"<information>{format_passages(retrieved)}</information>"
            retrieved_tokens = tokenizer.encode(retrieved_text)
            tokens.extend(retrieved_tokens)
            action_mask.extend([0] * len(retrieved_tokens))  # NOT LLM-generated

    return tokens, action_mask

7.3 核心技术贡献：Retrieved Token Masking

问题：rollout sequence 里混入了非 LLM 生成的 token（retrieved passages）。如果不处理，policy gradient 会对这些 token 也算梯度——这是错的：它们不是 policy 的 action。

更糟的是，不 mask 会让 loss 把 “搜索引擎返回的文本风格” 当成 policy 要学的东西，policy 学着模仿 retrieval 文风，而不是学 “如何搜索”。

解法：loss 只对 LLM 生成的 token 计算：

实现上就是把 action_mask 传进 PPO loss：

def search_r1_loss(logits, old_logprobs, ref_logprobs, actions,
                   advantages, action_mask, clip_eps=0.2, kl_coef=0.001):
    """
    Identical to standard PPO loss, except action_mask now also
    excludes retrieved tokens in addition to padding.
    """
    log_probs = F.log_softmax(logits, dim=-1)
    new_logprobs = log_probs.gather(-1, actions.unsqueeze(-1)).squeeze(-1)

    ratio = torch.exp(new_logprobs - old_logprobs)
    surr1 = ratio * advantages
    surr2 = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps) * advantages
    policy_loss = -torch.min(surr1, surr2)

    log_ratio_ref = ref_logprobs - new_logprobs
    kl = torch.exp(log_ratio_ref) - log_ratio_ref - 1

    # action_mask = 0 for padding OR retrieved tokens
    total = (policy_loss + kl_coef * kl) * action_mask
    return total.sum() / action_mask.sum()

效果（Qwen2.5-7B, avg EM over 7 QA datasets）：

Setting	Avg EM
不 mask retrieved tokens	0.343
Mask retrieved tokens	0.431

单这一个改动带来约 25% 相对提升。正确定义 action space 对 RL 训练的影响，可能比算法本身的改动更大。

更抽象地说：RL 应该只优化 agent 能控制的变量。Retrieved content 是环境的 response，agent 不控制它，不该被 policy gradient 塑形。这是 RL 的基础原则，在 “纯 LLM generation” setting 下容易被忽略；Agentic RL 让它重新变得重要。

7.4 实验观察

主结果（avg EM over 7 QA datasets, Qwen2.5-7B）：

方法	Avg EM
Direct Inference	0.181
RAG	0.304
R1 (RL without search)	0.276
Rejection Sampling + SFT	0.348
Search-R1 (PPO)	0.431

涌现行为：训出的模型展现了论文没有显式教授的 pattern：

Self-verification：拿到答案后再发一次 query 确认
Query refinement：首次搜索失败后改写 query 重试
Problem decomposition：复杂问题自动拆成多个 sub-query

论文 Table 9 的案例：问 “Curious 香水的歌手出生在哪个城市和州”。Search-R1 会：

<search>Curious fragrance information</search> → 发现是 Britney Spears 的
<search>Britney Spears birthplace</search> → McComb, Mississippi
<search>McComb, Mississippi location</search> → 验证确认
<answer>McComb, Mississippi</answer>

对比没有搜索能力的 R1：直接猜成 “Houston”（错把 Curious 归给 Beyoncé）。

9. 设计空间的全景总结

8.1 两个正交维度

                    │ Critic + GAE      │ Group Norm         │
                    │ (token-level adv) │ (response-level)   │
────────────────────┼───────────────────┼────────────────────┤
  Neural RM         │ 经典 RLHF          │ 少见                │
────────────────────┼───────────────────┼────────────────────┤
  Rule Verifier     │ Search-R1          │ DeepSeek-R1        │
────────────────────┼───────────────────┼────────────────────┤
  Gen Verifier      │ 新兴方向           │ 新兴方向            │

横轴决定成本和稳定性。Critic + GAE 多一个模型但更稳；Group Norm 省一个模型但粒度更粗。

纵轴决定适用范围。Neural RM 覆盖广但易 hack；Rule Verifier 抗 hack 但覆盖窄；Generative Verifier 介于两者之间。

8.2 Agentic RL 的本质

Agentic RL 不是新算法，是 环境的扩展。传统 RL for LLM 的环境只有 prompt 和 terminal reward；Agentic RL 的环境还包括工具、中间反馈、状态变化。

算法层仍是 PPO 或 GRPO。关键变化在 rollout——不再是一次性 generation，而是 policy 与环境的交替交互。这带来两个新挑战：

Rollout sequence 里混入非 policy token → 需要 retrieved token masking
Rollout 成本更高 → 对 sample efficiency 要求更高

8.3 当前的局限与开放问题

开放式任务的 reward 设计：Rule verifier 只适用有明确对错的任务。写作、对话等仍需 Neural RM 或人类反馈。
长 horizon reasoning：推理链达几万 token 时，Critic 和 Group Norm 都面临挑战。
Multi-tool 协同：reward 如何跨工具分配、rollout 如何管理复杂工具调用图，都还 open。
Outcome reward 下涌现行为的理论解释：为什么 self-verification 等行为会自发出现？在什么条件下？缺乏系统回答。

10. 结语与参考资料

快速索引

Cross Entropy                  -log π(y_t)              单 token 基础 loss
SFT Loss                       Σ CE                     无权重
REINFORCE                      Σ CE · A_t               advantage 加权
PPO                            + clip + min + KL        稳定化的 policy gradient
GRPO                           + group norm             去掉 Critic
DPO                            logistic on pref pairs   跳过 RL 的离线路线
Search-R1                      + retrieved token mask   Agentic RL 实例

Policy (π_θ)                   被训练的 LLM
Reference (π_ref)              训练起点的 LLM，冻结
Old (π_old)                    本轮 rollout 时的快照
Critic (V_φ)                   估计 future return
Reward Model                   scalar on (prompt, response)
Verifier                       广义概念：判断正确性的函数/模型

References

Jin et al. (2025). Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning. arXiv:2503.09516 ↩ ↩²
Schulman et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347 ↩ ↩²
Ouyang et al. (2022). Training language models to follow instructions with human feedback. arXiv:2203.02155 ↩ ↩²
Schulman, J. (2020). Approximating KL Divergence. http://joschu.net/blog/kl-approx.html ↩ ↩²
Schulman et al. (2015). High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv:1506.02438 ↩ ↩²
Rafailov et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv:2305.18290 ↩ ↩² ↩³ ↩⁴
Guo et al. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948 ↩ ↩²
Shao et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300 ↩ ↩²