Policy Optimization for LLMs:
From One Loss to Scalable Recipes
How RL trains language models, derived from a single scalar loss
through REINFORCE, PPO, GRPO, and beyond.
Part 1
The Established Three:
REINFORCE → PPO → GRPO
One Scalar Loss, One Chain Rule
Training = minimize a single scalar $\mathcal{L}(\theta)$, connected to all parameters through a differentiable computational graph.
$$\theta \leftarrow \theta - \alpha \nabla_\theta \mathcal{L}(\theta)$$
Key requirement
$\theta$ must appear as a differentiable variable inside $\mathcal{L}(\theta)$. No differentiable path from $\theta$ to $\mathcal{L}$ → no gradient → no learning.
Problem
Sampling a token from a vocabulary is discrete, non-differentiable. This single fact forces everything that follows.
SFT — Where Differentiability Holds
Gold sequence $y^*$ exists in the training data:
$$\mathcal{L}_\text{SFT}(\theta) = -\sum_t \log \pi_\theta(a_t^* \mid s_t)$$
No sampling
Each $a_t^*$ is a fixed constant index from the dataset. $\pi_\theta(a_t^*)$ is a softmax output — fully differentiable w.r.t. $\theta$.
The RL Objective
No gold sequence. The model must sample tokens, then receive a reward. We need to optimize over all possible outputs, weighted by their probability:
$$J(\theta) = \mathbb{E}_{y \sim \pi_\theta}[r(x,y)] = \sum_y \pi_\theta(y \mid x) \cdot r(x,y)$$
- We want $\nabla_\theta J(\theta) = \sum_y r(y) \cdot \nabla_\theta \pi_\theta(y)$ — sum over $|V|^T$ possible sequences. Completely intractable.
- This is why REINFORCE is needed.
REINFORCE — The Log-Derivative Trick
Start from the intractable gradient $\nabla_\theta J = \sum_y r(y) \cdot \nabla_\theta \pi_\theta(y)$:
- Step 1. Multiply-divide by $\pi_\theta(y)$: $= \sum_y r(y) \cdot \pi_\theta(y) \cdot \frac{\nabla_\theta \pi_\theta(y)}{\pi_\theta(y)}$
- Step 2. Apply $\nabla f / f = \nabla \log f$: $= \sum_y \pi_\theta(y) \cdot r(y) \cdot \nabla_\theta \log \pi_\theta(y)$
- Step 3. Recognize as expectation: $\nabla_\theta J = \mathbb{E}_{y \sim \pi_\theta}[r(y) \cdot \nabla_\theta \log \pi_\theta(y)]$
Now estimate the true gradient by sampling $N$ sequences and averaging:
$$\nabla_\theta J \approx \frac{1}{N}\sum_{i=1}^N r_i \cdot \nabla_\theta \log \pi_\theta(y_i)$$
- Surrogate loss (what the optimizer minimizes): $\mathcal{L}(\theta) = -\frac{1}{N}\sum_i r_i \cdot \log \pi_\theta(y_i)$
Williams, "Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning," Machine Learning, 1992.
REINFORCE — Intuition and Credit Assignment
$\nabla_\theta \log \pi_\theta(a_t)$ = "which direction" (make $a_t$ more likely). $r$ = "how much" to push.
Baseline
Reduces variance without bias: $\mathbb{E}[\nabla_\theta \log \pi_\theta(y) \cdot b(x)] = 0$. Advantage $A = r - b$: "how much better than expected."
- REINFORCE baseline: batch mean $\bar{r}$ across all prompts.
Credit assignment problem
Every token in a sequence gets the same scalar reward. Token 3 might be brilliant; token 200 might be the mistake. Same signal. High variance.
PPO — Reuse Rollouts via Importance Sampling
REINFORCE problem: each rollout used once, then discarded. PPO reuses rollouts for multiple optimizer steps. But after the first step, the policy $\pi_\theta$ has changed while the data was generated under the old $\pi_{\theta_\text{old}}$ — the rollouts are now off-policy.
Pattern A — Multi-epoch
Same batch trained 2-4 times.
Pattern B — Multi-batch sequential
Generate many rollouts, split into mini-batches, one optimizer step per mini-batch.
- After the first update, data was generated under $\pi_{\theta_\text{old}}$, not $\pi_\theta$. IS (Importance Sampling) ratio: $\rho_t = \pi_\theta(a_t \mid s_t) / \pi_{\theta_\text{old}}(a_t \mid s_t)$
$$J^\text{PPO}(\theta) = \mathbb{E}_t\!\left[\min\!\left(\rho_t A_t,\; \text{clip}(\rho_t, 1{-}\epsilon, 1{+}\epsilon)A_t\right)\right]$$
- Gradient (inside clip): $\nabla_\theta J^\text{PPO} = \mathbb{E}_t[A_t \cdot \rho_t \cdot \nabla_\theta \log \pi_\theta(a_t)]$. At $\theta = \theta_\text{old}$: $\rho_t = 1$, reduces to REINFORCE.
- $\nabla_\theta \rho_t = \frac{\nabla_\theta \pi_\theta}{\pi_{\theta_\text{old}}} = \frac{\pi_\theta}{\pi_{\theta_\text{old}}} \cdot \frac{\nabla_\theta \pi_\theta}{\pi_\theta} = \rho_t \cdot \nabla_\theta \log \pi_\theta$ ($\pi_{\theta_\text{old}}$ is a frozen constant.)
- When clip active: the clipped $\rho_t$ is replaced by a constant $(1 \pm \epsilon)$ — no longer a function of $\theta$. Differentiating a constant w.r.t. $\theta$ gives zero. Gradient killed entirely.
Schulman et al., "Proximal Policy Optimization Algorithms," arXiv, OpenAI, 2017.
PPO — GAE and the Value Model
Value model $V_\phi(s_t)$: a separate neural network that predicts expected future return from position $t$. "How much total reward do I expect from here to the end?" Trained by regression: $\mathcal{L}_V = (V_\phi(s_t) - R)^2$.
TD (Temporal Difference) residual: $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$. Compares the value model's prediction before and after one token. If $V$ jumped up at token $t$, that token was better than expected ($\delta_t > 0$). If $V$ dropped, that token hurt ($\delta_t < 0$). Per-token credit assignment.
Portfolio analogy
- $R - V(s_t)$ = "portfolio gained 300 total from day $t$ onward" (can't tell which day mattered).
- $\delta_t$ = "on day $t$ specifically, portfolio moved 50" (pinpoints the trade that helped or hurt).
GAE (Generalized Advantage Estimation): $A_t^\text{GAE} = \sum_{l=0}^{\infty}(\gamma\lambda)^l \delta_{t+l}$. Exponentially weighted sum of TD residuals from token $t$ onward. Nearby $\delta$'s are weighted heavily, distant $\delta$'s decay exponentially. $\lambda$ controls how far ahead to look:
- $\lambda = 0$: use only $\delta_t$. Sharp per-token credit, but fully trusts $V$.
- $\lambda = 1$: sum all future $\delta$'s equally, which reduces to Monte Carlo $R - V(s_t)$.
- $\lambda = 0.95$: practical sweet spot. Mostly per-token, with some smoothing.
Schulman et al., "High-Dimensional Continuous Control Using Generalized Advantage Estimation," ICLR, 2016.
PPO — Memory and Context
PPO was the standard for RLHF (InstructGPT-style). With learned RM, 5 models on GPU:
1
Policy $\pi_\theta$
Trainable
2
Ref $\pi_\text{ref}$
Frozen, KL penalty
3
Old $\pi_\text{old}$
For $\rho_t$
4
Value $V_\phi$
Trainable, GAE
RLVR (verifiable rewards, math/code) changes the picture:
- Reward is rule-based and exact → no RM (drop #5). Low reward hacking risk.
- RLVR recipes omit the critic entirely → no $V_\phi$ (drop #4).
Orthogonality
Reward source (verifier vs learned RM) and gradient estimation (REINFORCE / PPO / GRPO) are independent axes. PPO+RM→RLHF, GRPO+verifier→RLVR are conventions, not requirements.
KL in PPO — Reward shaping
Per-token KL penalty $-\beta \log \frac{\pi_\theta(a_t)}{\pi_\text{ref}(a_t)}$ added to the reward at each token. Flows through GAE → per-token credit assignment for KL divergence.
GRPO — Drop the Critic
GRPO removes the value model. Sample $G$ completions per prompt, use group statistics as baseline.
Advantage
$A_i = (r_i - \mu_G) / \sigma_G$ — per-sequence, broadcast to all tokens. Per-prompt grouping: "better than average for this specific problem" is more informative than REINFORCE's batch-level baseline.
$$J^\text{GRPO}(\theta) = \frac{1}{G}\sum_{i}\frac{1}{T_i}\sum_t \min(\rho_{i,t}A_i,\; \text{clip}(\rho_{i,t})A_i) - \beta D_\text{KL}(\pi_\theta \| \pi_\text{ref})$$
- Gradient (inside clip): $\nabla_\theta J^\text{GRPO} = \frac{1}{G}\sum_i A_i \cdot \frac{1}{T_i}\sum_t \rho_{i,t} \cdot \nabla_\theta \log \pi_\theta(a_{i,t})$
- Same structure as PPO. Gradient dies when clip active.
KL in GRPO — Objective regularizer
$D_\text{KL}(\pi_\theta \| \pi_\text{ref})$ enters as a regularizer on the objective, not as PPO-style per-token reward shaping through GAE.
Shao et al., "DeepSeekMath," arXiv, DeepSeek, 2024. | Guo et al., "DeepSeek-R1," Nature, DeepSeek, 2025.
Part 2
Problems and Fixes
GRPO established the critic-free recipe. What's left to fix?
What's Left to Fix?
- Problem 1: Zero-gradient prompts (all correct or all wrong) → DAPO, ScaleRL
- Problem 2: Trust region issues
- 2a. Symmetric clip too tight for low-prob tokens → DAPO
- 2b. Gradient death at clip boundary → CISPO, SAPO
- 2c. Ratio measures the wrong thing → DPPO
- Problem 3: Per-token gradient magnitude bias
- 3a. $\sigma$ normalization inflates easy prompts → Dr. GRPO, ScaleRL
- 3b. Loss aggregation bias (length) → ScaleRL
- Problem 4: MoE token-level ratio instability → GSPO
ScaleRL (Khatri et al., Meta, 2025): 400k+ GPU-hours of systematic ablations. Fits sigmoidal performance-vs-compute curves separating asymptotic ceiling ($A$) from compute efficiency ($B$). A study-backed recipe, not a single named algorithm.
Khatri et al., "The Art of Scaling Reinforcement Learning Compute for LLMs," arXiv, Meta / UT Austin, 2025.
Problem 1 — Zero-Gradient Prompts
If all $G$ rollouts for a prompt are correct (or all wrong): $r_i - \mu_G = 0$. Zero advantage. Zero gradient. Prompt contributes nothing.
Dynamic Sampling (DAPO)
Discard zero-variance prompts, fill with new prompts. Effective batch size maintained. Extra generation required, but prompt diversity preserved.
Filtering (ScaleRL)
Discard zero-variance prompts, don't fill. Effective batch size shrinks. No extra generation. Also permanently retire "mastered" prompts (pass rate ≥ 0.9). Simpler.
ScaleRL verdict: filtering is preferred. Permanent retirement of easy prompts (≥ 0.9) progressively shifts the training distribution toward harder problems as the model improves — an adaptive curriculum that dynamic sampling lacks.
Yu et al., "DAPO: An Open-Source LLM Reinforcement Learning System at Scale," arXiv, ByteDance Seed, 2025.
Problem 2a — Symmetric Clip Too Restrictive
Token at probability 0.01 can only rise to 0.012 before clipping ($\epsilon = 0.2$). Barely changes sampling likelihood. Critical low-probability tokens ("Wait," "However") get suppressed.
- Asymmetric clipping (DAPO): $\epsilon_\text{low} = 0.2$, $\epsilon_\text{high} = 0.28$. Positive-advantage tokens get more room.
$$J^\text{DAPO}(\theta) = \frac{1}{\sum_i T_i}\sum_i\sum_t \min\!\left(\rho_{i,t}A_i,\; \text{clip}(\rho_{i,t}, 1{-}\epsilon_l, 1{+}\epsilon_h)A_i\right)$$
- Gradient (inside clip): $\nabla_\theta J^\text{DAPO} = \frac{1}{\sum_i T_i}\sum_i A_i \sum_t \rho_{i,t} \cdot \nabla_\theta \log \pi_\theta(a_{i,t})$
- Wider asymmetric bounds mean fewer tokens hit the boundary. But when they do, same problem: $\nabla_\theta \rho_{i,t} = 0$.
Problem 2b — Gradient Death at Trust Region Boundary
PPO/GRPO/DAPO: gradient flows through $\rho_t$. When clip active, $\rho_t$ becomes constant → $\nabla_\theta \rho_t = 0$. The most informative tokens (largest probability shift) are the first to be clipped.
CISPO (MiniMax-M1)
Detach the ratio. Gradient flows through $\log \pi_\theta$ only:
$J^\text{CISPO} = \mathbb{E}[\text{sg}(\hat{\rho}_t) \cdot A_t \cdot \log \pi_\theta(a_t)]$
$\nabla_\theta J^\text{CISPO} = \mathbb{E}[\text{sg}(\hat{\rho}_t) \cdot A_t \cdot \nabla_\theta \log \pi_\theta(a_t)]$
$\text{sg}$ = stop-gradient. Ratio acts as magnitude weight only. Attenuated but never killed.
SAPO (Qwen3-VL)
Same family, sigmoid gate instead of clamp:
$w_t = \sigma(-\tau|\log \rho_t|)$
$\nabla_\theta J^\text{SAPO} = \mathbb{E}[\text{sg}(w_t) \cdot A_t \cdot \nabla_\theta \log \pi_\theta(a_t)]$
On-policy: $w_t$ high. Off-policy: $w_t \to 0$ smoothly. No discontinuity.
MiniMax, "MiniMax-M1," arXiv, 2025. | Gao et al., "Soft Adaptive Policy Optimization," arXiv, Alibaba Qwen, 2025.
Problem 2c — The Ratio Measures the Wrong Thing
Over-penalized
Rare token at $10^{-4}$ doubles to $2 \times 10^{-4}$: $\rho = 2$, exceeds any clip. But distribution barely changed.
Under-penalized
Dominant token at 0.8 drops to 0.6: $\rho = 0.75$, within bounds. But 20 percentage points of mass moved.
The ratio measures relative change at one token, not actual distributional shift.
DPPO: replace ratio-based clip with a divergence-based mask. The paper uses efficient Binary and Top-K approximations to estimate policy divergence with negligible overhead. A sampled-action proxy used inside efficient divergence estimation:
- $D_\text{TV}(t) \approx |\pi_\theta(a_t) - \pi_{\theta_\text{old}}(a_t)|$ — tokens exceeding threshold $\tau$ are masked out.
TV (Total Variation)
$D_\text{TV} = \frac{1}{2}\sum_j |P(j) - Q(j)|$, range $[0,1]$. Measures absolute probability shift. Insensitive to large ratio changes at low probability.
KL (Kullback-Leibler)
$D_\text{KL} = \sum_j P(j)\log\frac{P(j)}{Q(j)}$, range $[0,\infty)$. Reacts to ratio changes via $\log$.
- Gradient: $\nabla_\theta J^\text{DPPO} = \mathbb{E}[M_\text{div}(D_\text{TV}, \tau) \cdot A_t \cdot \rho_t \cdot \nabla_\theta \log \pi_\theta(a_t)]$
- Masking criterion based on actual distributional shift, not ratio value. < 0.5% of updates cause instability.
Qi et al., "Rethinking the Trust Region in LLM Reinforcement Learning," arXiv, Sea AI Lab, 2026.
Problem 3a — Normalization Bias
GRPO normalizes advantage by group standard deviation: $A_i = (r_i - \mu_G)/\sigma_G$.
Problem
Easy prompt (15/16 correct): $\sigma \approx 0.24$, one failure gets $A \approx -3.9$. Hard prompt (8/16): $\sigma = 0.5$, failure gets $A = -1$. Easy-prompt outliers produce ~4× larger advantages, dominating the batch gradient despite being least informative.
Dr. GRPO fix
Drop $\sigma$. Use $A_i = r_i - \mu_G$.
ScaleRL finding: prompt-level, batch-level, and no $\sigma$ normalization all yield similar performance. Batch-level adopted.
- Prompt-level $\sigma$: each prompt divided by its own $\sigma_G$. Different prompts get different divisors, distorting relative gradient magnitudes across prompts.
- Batch-level $\sigma$: first compute $A_i = r_i - \mu_G$ per prompt (no $\sigma_G$ division), then divide all advantages across the entire batch by one shared std. Same divisor for everyone — just a global scale factor that does not distort relative magnitudes.
Liu et al., "Understanding R1-Zero-Like Training: A Critical Perspective," COLM, Sea AI Lab, 2025.
Problem 3b — Loss Aggregation Bias
How tokens are weighted across sequences determines which outputs dominate the gradient.
| Method | Unit | Bias | Used by |
| Sample-level ($1/T_i$ per rollout) | Each rollout equal | Short outputs favored | GRPO |
| Fixed-length ($1/T_\text{max}$) | Length-proportional | Short bias removed | Dr. GRPO |
| Token-level ($1/\sum T_i$) | Each token equal | Long outputs favored | DAPO, CISPO |
| Prompt-level (hierarchical avg) | Each problem equal | No length bias | ScaleRL |
ScaleRL verdict: prompt-level aggregation achieves highest asymptotic performance.
Problem 4 — MoE Token-Level Ratio Instability
MoE models: expert routing changes between $\pi_{\theta_\text{old}}$ and $\pi_\theta$. ~10% of activated experts differ after one gradient update (Qwen3-30B-A3B). Individual token ratios $\rho_t$ fluctuate wildly.
- GSPO (Qwen3): average log ratios across the sequence → one $\rho_i$ per sequence:
$$\rho_i = \exp\!\left(\frac{1}{T_i}\sum_t \log \rho_t\right) \qquad J^\text{GSPO} = \frac{1}{G}\sum_i \min\!\left(\rho_i A_i,\; \text{clip}(\rho_i)A_i\right)$$
- Gradient: $\nabla_\theta J^\text{GSPO} = \frac{1}{G}\sum_i A_i \cdot \rho_i \cdot \frac{1}{T_i}\sum_t \nabla_\theta \log \pi_\theta(a_{i,t})$
- Token-level noise averages out (law of large numbers). One clip decision per sequence.
- Substantially stabilized MoE RL training without Routing Replay.
- Tradeoff: outlier tokens can suppress entire sequence gradient, or be masked by normal tokens.
Zheng et al., "Group Sequence Policy Optimization," arXiv, Alibaba Qwen, 2025.
Methods ↔ Problems Summary
| Problem | Methods | Core idea |
| Zero-gradient prompts | DAPO, ScaleRL | Ensure every prompt contributes signal |
| Symmetric clip too tight | DAPO | Asymmetric bounds for positive-advantage tokens |
| Gradient death at boundary | CISPO, SAPO | Detach ratio, gradient through $\log \pi_\theta$ |
| Ratio measures wrong thing | DPPO | Divergence-based mask instead of ratio clip |
| $\sigma$ normalization bias | Dr. GRPO | Drop $\sigma$, use $r - \mu$ |
| Length normalization bias | Dr. GRPO | $1/T_\text{max}$ instead of $1/T_i$ |
| Loss aggregation bias | ScaleRL | Prompt-level averaging |
| MoE ratio instability | GSPO | Sequence-level ratio |
Three Structural Axes
Axis 1 — Gradient source
- Raw $\nabla_\theta \log \pi_\theta$: REINFORCE
- Ratio $\rho_t$ (gradient = $\rho_t \cdot \nabla_\theta \log \pi_\theta$): PPO, GRPO, DAPO, GSPO, Dr.GRPO, DPPO
- Hybrid (detached $\rho$ + $\nabla_\theta \log \pi_\theta$): CISPO, SAPO
Axis 2 — Trust region
- None: REINFORCE | Symmetric clip: PPO, GRPO, Dr.GRPO | Asymmetric clip: DAPO
- Clamped detached weight: CISPO | Sigmoid gate: SAPO | Seq-level clip: GSPO | Divergence mask: DPPO
Axis 3 — Advantage / Objective
- GAE per-token (critic): PPO | $(r-\mu)/\sigma$ group-norm: GRPO, DAPO, CISPO, GSPO, SAPO
- $r - \mu$ (no $\sigma$): Dr.GRPO | Batch-level norm: ScaleRL
Part 3
Synthesis:
ScaleRL, Recipes, and Adoption
ScaleRL — The Scaling Framework
- 400k+ GPU-hours of systematic ablations across RL design choices (Meta / UT Austin, 2025).
- Fits sigmoidal curves to predict the ultimate performance ceiling of an RL recipe from early-stage runs, without full runs.
- Validated by running 100K GPU-hours to convergence, showing a fit on the first 50K accurately predicts the remaining trajectory.
Khatri et al., "The Art of Scaling Reinforcement Learning Compute for LLMs," arXiv, Meta / UT Austin, 2025.
ScaleRL Key Findings
Shifts $A$ (the ceiling)
- Loss type: CISPO/GSPO achieve substantially higher $A$ than DAPO.
- FP32 logits in LM head: single largest asymptotic gain.
Shifts $B$ (efficiency) only, not $A$
- Advantage normalization: prompt-level, batch-level, no normalization all yield similar $A$.
- Loss aggregation: prompt-level and token-level yield similar $A$, both ahead of sample-level. Prompt-level more compute-efficient (higher $B$).
- Async RL (PipelineRL): improves $B$ substantially over PPO-off-policy. Same $A$.
The ScaleRL Recipe
One concrete instantiation combining the study's best practices:
Loss
CISPO
Detached clamped ratio × $\nabla_\theta \log \pi_\theta$
Aggregation
Prompt-level
No length bias
Advantage
Batch-level norm
No $\sigma$ normalization
Precision
FP32 logits
At LM head
Data
Zero-var filter
Retire easy prompts (≥ 0.9)
Infra
Async RL
PipelineRL, $k=8$
Which Models Used What?
| Model | PO Method | Post-training pipeline |
DeepSeek-R1 671B MoE, 37B active; Jan 2025 |
GRPO |
(1) Cold-start SFT → (2) reasoning RL → (3) rejection sampling SFT → (4) alignment RL. |
DeepSeek-R1-Distill 1.5B–70B; Jan 2025 |
— |
No direct RL. Distillation of R1's reasoning outputs into Qwen2.5 and Llama3 base models via SFT. "We believe that applying RL to the distilled models would yield significant further improvements, which we leave for future work." |
Qwen3 flagship 235B-A22B; Apr 2025 |
GSPO |
(1) Long-CoT cold-start SFT → (2) reasoning RL → (3) thinking mode fusion SFT (learn to toggle thinking/non-thinking) → (4) general RL (20+ task domains). |
Qwen3 small 0.6B–14B; Apr 2025 |
— |
No direct RL. Strong-to-weak distillation from flagship: (1) off-policy distillation → (2) on-policy distillation. "Distillation from advanced teacher models significantly outperforms reinforcement learning in performance and training efficiency." |
MiniMax-M1 456B MoE, 45.9B active; Jun 2025 |
CISPO |
(1) Continual pretraining (reasoning-intensive) → (2) cold-start SFT → (3) CISPO RL. |
MiniMax-M2.5 230B MoE, 10B active; Feb 2026 |
CISPO |
CISPO RL at scale across 200k+ real-world environments (code, office, web). Process reward for long-horizon credit assignment. Detailed pipeline not disclosed. |
GLM-5 744B MoE, 40B active; Feb 2026 |
undisclosed |
(1) SFT → (2) reasoning RL → (3) agentic RL → (4) general RL. |
Training Pipeline Context
Policy Optimization is one stage in a multi-stage pipeline. Common pattern:
SFT
→
Reasoning RL
→
(optional stages)
→
General RL
- Trend: later models expand RL scope. R1 and Qwen3 focus on reasoning RL. GLM-5 adds a dedicated agentic RL stage. MiniMax-M2.5 scales RL across 200k+ real-world environments (code, office, web).
- Small models skip RL entirely. Both DeepSeek-R1-Distill and Qwen3 small use distillation from flagship instead. Qwen3 explicitly claims distillation outperforms RL for smaller models; DeepSeek-R1 reports it for the Qwen2.5-32B case.
- NVIDIA's AceReason-Nemotron shows distillation → RL works in small models: GRPO on top of DeepSeek-R1-Distill-Qwen-7B yields +14.6pp on AIME 2025 and +6.8pp on LiveCodeBench (7B, math-only RL).
Takeaways
- One equation underlies everything: $\nabla_\theta J = \mathbb{E}[\nabla_\theta \log \pi_\theta \cdot A]$. All methods are strategies for making this gradient more accurate, lower variance, or more stable.
- Critic-free training won for RLVR. Per-token credit (PPO/GAE) lost to simplicity + scale.
- Softer trust regions beat hard clipping. CISPO's "never kill gradient" outperforms PPO-style masking and is most robust to hyperparameters (ScaleRL).
- Not all recipes scale equally. Loss type and FP32 precision shift the ceiling $A$. Most other choices only modulate efficiency $B$.
- $\mathcal{L}(\theta)$: one scalar, one chain rule. Everything since slide 2 is about choosing the right scalar.