Policy Optimization for LLMs:
From One Loss to Scalable Recipes

How RL trains language models, derived from a single scalar loss
through REINFORCE, PPO, GRPO, and beyond.

Part 1

The Established Three:
REINFORCE → PPO → GRPO

One Scalar Loss, One Chain Rule

Training = minimize a single scalar $\mathcal{L}(\theta)$, connected to all parameters through a differentiable computational graph.

\theta \leftarrow \theta - \alpha \nabla_\theta \mathcal{L}(\theta)

Key requirement $\theta$ must appear as a differentiable variable inside $\mathcal{L}(\theta)$. No differentiable path from $\theta$ to $\mathcal{L}$ → no gradient → no learning.

Problem Sampling a token from a vocabulary is discrete, non-differentiable. This single fact forces everything that follows.

SFT — Where Differentiability Holds

Gold sequence $y^*$ exists in the training data:

\mathcal{L}_\text{SFT}(\theta) = -\sum_t \log \pi_\theta(a_t^* \mid s_t)

No sampling Each $a_t^*$ is a fixed constant index from the dataset. $\pi_\theta(a_t^*)$ is a softmax output — fully differentiable w.r.t. $\theta$.

The RL Objective

No gold sequence. The model must sample tokens, then receive a reward. We need to optimize over all possible outputs, weighted by their probability:

J(\theta) = \mathbb{E}_{y \sim \pi_\theta}[r(x,y)] = \sum_y \pi_\theta(y \mid x) \cdot r(x,y)

We want $\nabla_\theta J(\theta) = \sum_y r(y) \cdot \nabla_\theta \pi_\theta(y)$ — sum over $|V|^T$ possible sequences. Completely intractable.
This is why REINFORCE is needed.

REINFORCE — The Log-Derivative Trick

Start from the intractable gradient $\nabla_\theta J = \sum_y r(y) \cdot \nabla_\theta \pi_\theta(y)$:

Step 1. Multiply-divide by $\pi_\theta(y)$: $= \sum_y r(y) \cdot \pi_\theta(y) \cdot \frac{\nabla_\theta \pi_\theta(y)}{\pi_\theta(y)}$
Step 2. Apply $\nabla f / f = \nabla \log f$: $= \sum_y \pi_\theta(y) \cdot r(y) \cdot \nabla_\theta \log \pi_\theta(y)$
Step 3. Recognize as expectation: $\nabla_\theta J = \mathbb{E}_{y \sim \pi_\theta}[r(y) \cdot \nabla_\theta \log \pi_\theta(y)]$

Now estimate the true gradient by sampling $N$ sequences and averaging:

\nabla_\theta J \approx \frac{1}{N}\sum_{i=1}^N r_i \cdot \nabla_\theta \log \pi_\theta(y_i)

Surrogate loss (what the optimizer minimizes): $\mathcal{L}(\theta) = -\frac{1}{N}\sum_i r_i \cdot \log \pi_\theta(y_i)$

Williams, "Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning," Machine Learning, 1992.

REINFORCE — Intuition and Credit Assignment

$\nabla_\theta \log \pi_\theta(a_t)$ = "which direction" (make $a_t$ more likely). $r$ = "how much" to push.

Baseline Reduces variance without bias: $\mathbb{E}[\nabla_\theta \log \pi_\theta(y) \cdot b(x)] = 0$. Advantage $A = r - b$: "how much better than expected."

REINFORCE baseline: batch mean $\bar{r}$ across all prompts.

Credit assignment problem Every token in a sequence gets the same scalar reward. Token 3 might be brilliant; token 200 might be the mistake. Same signal. High variance.

PPO — Reuse Rollouts via Importance Sampling

REINFORCE problem: each rollout used once, then discarded. PPO reuses rollouts for multiple optimizer steps. But after the first step, the policy $\pi_\theta$ has changed while the data was generated under the old $\pi_{\theta_\text{old}}$ — the rollouts are now off-policy.

Pattern A — Multi-epoch Same batch trained 2-4 times.

Pattern B — Multi-batch sequential Generate many rollouts, split into mini-batches, one optimizer step per mini-batch.

After the first update, data was generated under $\pi_{\theta_\text{old}}$, not $\pi_\theta$. IS (Importance Sampling) ratio: $\rho_t = \pi_\theta(a_t \mid s_t) / \pi_{\theta_\text{old}}(a_t \mid s_t)$

J^\text{PPO}(\theta) = \mathbb{E}_t\!\left[\min\!\left(\rho_t A_t,\; \text{clip}(\rho_t, 1{-}\epsilon, 1{+}\epsilon)A_t\right)\right]

Gradient (inside clip): $\nabla_\theta J^\text{PPO} = \mathbb{E}_t[A_t \cdot \rho_t \cdot \nabla_\theta \log \pi_\theta(a_t)]$. At $\theta = \theta_\text{old}$: $\rho_t = 1$, reduces to REINFORCE.
- $\nabla_\theta \rho_t = \frac{\nabla_\theta \pi_\theta}{\pi_{\theta_\text{old}}} = \frac{\pi_\theta}{\pi_{\theta_\text{old}}} \cdot \frac{\nabla_\theta \pi_\theta}{\pi_\theta} = \rho_t \cdot \nabla_\theta \log \pi_\theta$ ($\pi_{\theta_\text{old}}$ is a frozen constant.)
When clip active: the clipped $\rho_t$ is replaced by a constant $(1 \pm \epsilon)$ — no longer a function of $\theta$. Differentiating a constant w.r.t. $\theta$ gives zero. Gradient killed entirely.

Schulman et al., "Proximal Policy Optimization Algorithms," arXiv, OpenAI, 2017.

PPO — GAE and the Value Model

Value model $V_\phi(s_t)$: a separate neural network that predicts expected future return from position $t$. "How much total reward do I expect from here to the end?" Trained by regression: $\mathcal{L}_V = (V_\phi(s_t) - R)^2$.

TD (Temporal Difference) residual: $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$. Compares the value model's prediction before and after one token. If $V$ jumped up at token $t$, that token was better than expected ($\delta_t > 0$). If $V$ dropped, that token hurt ($\delta_t < 0$). Per-token credit assignment.

Portfolio analogy

$R - V(s_t)$ = "portfolio gained 300 total from day $t$ onward" (can't tell which day mattered).
$\delta_t$ = "on day $t$ specifically, portfolio moved 50" (pinpoints the trade that helped or hurt).

GAE (Generalized Advantage Estimation): $A_t^\text{GAE} = \sum_{l=0}^{\infty}(\gamma\lambda)^l \delta_{t+l}$. Exponentially weighted sum of TD residuals from token $t$ onward. Nearby $\delta$'s are weighted heavily, distant $\delta$'s decay exponentially. $\lambda$ controls how far ahead to look:

$\lambda = 0$: use only $\delta_t$. Sharp per-token credit, but fully trusts $V$.
$\lambda = 1$: sum all future $\delta$'s equally, which reduces to Monte Carlo $R - V(s_t)$.
$\lambda = 0.95$: practical sweet spot. Mostly per-token, with some smoothing.

Schulman et al., "High-Dimensional Continuous Control Using Generalized Advantage Estimation," ICLR, 2016.

PPO — Memory and Context

PPO was the standard for RLHF (InstructGPT-style). With learned RM, 5 models on GPU:

1

Policy $\pi_\theta$

Trainable

2

Ref $\pi_\text{ref}$

Frozen, KL penalty

3

Old $\pi_\text{old}$

For $\rho_t$

4

Value $V_\phi$

Trainable, GAE

5

Reward RM

Frozen

RLVR (verifiable rewards, math/code) changes the picture:

Reward is rule-based and exact → no RM (drop #5). Low reward hacking risk.
RLVR recipes omit the critic entirely → no $V_\phi$ (drop #4).

Orthogonality Reward source (verifier vs learned RM) and gradient estimation (REINFORCE / PPO / GRPO) are independent axes. PPO+RM→RLHF, GRPO+verifier→RLVR are conventions, not requirements.

KL in PPO — Reward shaping Per-token KL penalty $-\beta \log \frac{\pi_\theta(a_t)}{\pi_\text{ref}(a_t)}$ added to the reward at each token. Flows through GAE → per-token credit assignment for KL divergence.

GRPO — Drop the Critic

GRPO removes the value model. Sample $G$ completions per prompt, use group statistics as baseline.

Advantage $A_i = (r_i - \mu_G) / \sigma_G$ — per-sequence, broadcast to all tokens. Per-prompt grouping: "better than average for this specific problem" is more informative than REINFORCE's batch-level baseline.

J^\text{GRPO}(\theta) = \frac{1}{G}\sum_{i}\frac{1}{T_i}\sum_t \min(\rho_{i,t}A_i,\; \text{clip}(\rho_{i,t})A_i) - \beta D_\text{KL}(\pi_\theta \| \pi_\text{ref})

Gradient (inside clip): $\nabla_\theta J^\text{GRPO} = \frac{1}{G}\sum_i A_i \cdot \frac{1}{T_i}\sum_t \rho_{i,t} \cdot \nabla_\theta \log \pi_\theta(a_{i,t})$
- Same structure as PPO. Gradient dies when clip active.

KL in GRPO — Objective regularizer $D_\text{KL}(\pi_\theta \| \pi_\text{ref})$ enters as a regularizer on the objective, not as PPO-style per-token reward shaping through GAE.

Shao et al., "DeepSeekMath," arXiv, DeepSeek, 2024. | Guo et al., "DeepSeek-R1," Nature, DeepSeek, 2025.

Part 2

Problems and Fixes

GRPO established the critic-free recipe. What's left to fix?

What's Left to Fix?

Problem 1: Zero-gradient prompts (all correct or all wrong) → DAPO, ScaleRL
Problem 2: Trust region issues
- 2a. Symmetric clip too tight for low-prob tokens → DAPO
- 2b. Gradient death at clip boundary → CISPO, SAPO
- 2c. Ratio measures the wrong thing → DPPO
Problem 3: Per-token gradient magnitude bias
- 3a. $\sigma$ normalization inflates easy prompts → Dr. GRPO, ScaleRL
- 3b. Loss aggregation bias (length) → ScaleRL
Problem 4: MoE token-level ratio instability → GSPO

ScaleRL (Khatri et al., Meta, 2025): 400k+ GPU-hours of systematic ablations. Fits sigmoidal performance-vs-compute curves separating asymptotic ceiling ($A$) from compute efficiency ($B$). A study-backed recipe, not a single named algorithm.

Khatri et al., "The Art of Scaling Reinforcement Learning Compute for LLMs," arXiv, Meta / UT Austin, 2025.

Problem 1 — Zero-Gradient Prompts

If all $G$ rollouts for a prompt are correct (or all wrong): $r_i - \mu_G = 0$. Zero advantage. Zero gradient. Prompt contributes nothing.

Dynamic Sampling (DAPO) Discard zero-variance prompts, fill with new prompts. Effective batch size maintained. Extra generation required, but prompt diversity preserved.

Filtering (ScaleRL) Discard zero-variance prompts, don't fill. Effective batch size shrinks. No extra generation. Also permanently retire "mastered" prompts (pass rate ≥ 0.9). Simpler.

ScaleRL verdict: filtering is preferred. Permanent retirement of easy prompts (≥ 0.9) progressively shifts the training distribution toward harder problems as the model improves — an adaptive curriculum that dynamic sampling lacks.

Yu et al., "DAPO: An Open-Source LLM Reinforcement Learning System at Scale," arXiv, ByteDance Seed, 2025.

Problem 2a — Symmetric Clip Too Restrictive

Token at probability 0.01 can only rise to 0.012 before clipping ($\epsilon = 0.2$). Barely changes sampling likelihood. Critical low-probability tokens ("Wait," "However") get suppressed.

Asymmetric clipping (DAPO): $\epsilon_\text{low} = 0.2$, $\epsilon_\text{high} = 0.28$. Positive-advantage tokens get more room.

J^\text{DAPO}(\theta) = \frac{1}{\sum_i T_i}\sum_i\sum_t \min\!\left(\rho_{i,t}A_i,\; \text{clip}(\rho_{i,t}, 1{-}\epsilon_l, 1{+}\epsilon_h)A_i\right)

Gradient (inside clip): $\nabla_\theta J^\text{DAPO} = \frac{1}{\sum_i T_i}\sum_i A_i \sum_t \rho_{i,t} \cdot \nabla_\theta \log \pi_\theta(a_{i,t})$
Wider asymmetric bounds mean fewer tokens hit the boundary. But when they do, same problem: $\nabla_\theta \rho_{i,t} = 0$.

Problem 2b — Gradient Death at Trust Region Boundary

PPO/GRPO/DAPO: gradient flows through $\rho_t$. When clip active, $\rho_t$ becomes constant → $\nabla_\theta \rho_t = 0$. The most informative tokens (largest probability shift) are the first to be clipped.

CISPO (MiniMax-M1)

Detach the ratio. Gradient flows through $\log \pi_\theta$ only:

$J^\text{CISPO} = \mathbb{E}[\text{sg}(\hat{\rho}_t) \cdot A_t \cdot \log \pi_\theta(a_t)]$

$\nabla_\theta J^\text{CISPO} = \mathbb{E}[\text{sg}(\hat{\rho}_t) \cdot A_t \cdot \nabla_\theta \log \pi_\theta(a_t)]$

$\text{sg}$ = stop-gradient. Ratio acts as magnitude weight only. Attenuated but never killed.

SAPO (Qwen3-VL)

Same family, sigmoid gate instead of clamp:

$w_t = \sigma(-\tau|\log \rho_t|)$

$\nabla_\theta J^\text{SAPO} = \mathbb{E}[\text{sg}(w_t) \cdot A_t \cdot \nabla_\theta \log \pi_\theta(a_t)]$

On-policy: $w_t$ high. Off-policy: $w_t \to 0$ smoothly. No discontinuity.

MiniMax, "MiniMax-M1," arXiv, 2025. | Gao et al., "Soft Adaptive Policy Optimization," arXiv, Alibaba Qwen, 2025.

Problem 2c — The Ratio Measures the Wrong Thing

Over-penalized Rare token at $10^{-4}$ doubles to $2 \times 10^{-4}$: $\rho = 2$, exceeds any clip. But distribution barely changed.

Under-penalized Dominant token at 0.8 drops to 0.6: $\rho = 0.75$, within bounds. But 20 percentage points of mass moved.

The ratio measures relative change at one token, not actual distributional shift.

DPPO: replace ratio-based clip with a divergence-based mask. The paper uses efficient Binary and Top-K approximations to estimate policy divergence with negligible overhead. A sampled-action proxy used inside efficient divergence estimation:

$D_\text{TV}(t) \approx |\pi_\theta(a_t) - \pi_{\theta_\text{old}}(a_t)|$ — tokens exceeding threshold $\tau$ are masked out.

TV (Total Variation) $D_\text{TV} = \frac{1}{2}\sum_j |P(j) - Q(j)|$, range $[0,1]$. Measures absolute probability shift. Insensitive to large ratio changes at low probability.

KL (Kullback-Leibler) $D_\text{KL} = \sum_j P(j)\log\frac{P(j)}{Q(j)}$, range $[0,\infty)$. Reacts to ratio changes via $\log$.

Gradient: $\nabla_\theta J^\text{DPPO} = \mathbb{E}[M_\text{div}(D_\text{TV}, \tau) \cdot A_t \cdot \rho_t \cdot \nabla_\theta \log \pi_\theta(a_t)]$
Masking criterion based on actual distributional shift, not ratio value. < 0.5% of updates cause instability.

Qi et al., "Rethinking the Trust Region in LLM Reinforcement Learning," arXiv, Sea AI Lab, 2026.

Problem 3a — Normalization Bias

GRPO normalizes advantage by group standard deviation: $A_i = (r_i - \mu_G)/\sigma_G$.

Problem Easy prompt (15/16 correct): $\sigma \approx 0.24$, one failure gets $A \approx -3.9$. Hard prompt (8/16): $\sigma = 0.5$, failure gets $A = -1$. Easy-prompt outliers produce ~4× larger advantages, dominating the batch gradient despite being least informative.

Dr. GRPO fix Drop $\sigma$. Use $A_i = r_i - \mu_G$.

ScaleRL finding: prompt-level, batch-level, and no $\sigma$ normalization all yield similar performance. Batch-level adopted.

Prompt-level $\sigma$: each prompt divided by its own $\sigma_G$. Different prompts get different divisors, distorting relative gradient magnitudes across prompts.
Batch-level $\sigma$: first compute $A_i = r_i - \mu_G$ per prompt (no $\sigma_G$ division), then divide all advantages across the entire batch by one shared std. Same divisor for everyone — just a global scale factor that does not distort relative magnitudes.

Liu et al., "Understanding R1-Zero-Like Training: A Critical Perspective," COLM, Sea AI Lab, 2025.

Problem 3b — Loss Aggregation Bias

How tokens are weighted across sequences determines which outputs dominate the gradient.

Method	Unit	Bias	Used by
Sample-level ($1/T_i$ per rollout)	Each rollout equal	Short outputs favored	GRPO
Fixed-length ($1/T_\text{max}$)	Length-proportional	Short bias removed	Dr. GRPO
Token-level ($1/\sum T_i$)	Each token equal	Long outputs favored	DAPO, CISPO
Prompt-level (hierarchical avg)	Each problem equal	No length bias	ScaleRL

ScaleRL verdict: prompt-level aggregation achieves highest asymptotic performance.

Problem 4 — MoE Token-Level Ratio Instability

MoE models: expert routing changes between $\pi_{\theta_\text{old}}$ and $\pi_\theta$. ~10% of activated experts differ after one gradient update (Qwen3-30B-A3B). Individual token ratios $\rho_t$ fluctuate wildly.

GSPO (Qwen3): average log ratios across the sequence → one $\rho_i$ per sequence:

\rho_i = \exp\!\left(\frac{1}{T_i}\sum_t \log \rho_t\right) \qquad J^\text{GSPO} = \frac{1}{G}\sum_i \min\!\left(\rho_i A_i,\; \text{clip}(\rho_i)A_i\right)

Gradient: $\nabla_\theta J^\text{GSPO} = \frac{1}{G}\sum_i A_i \cdot \rho_i \cdot \frac{1}{T_i}\sum_t \nabla_\theta \log \pi_\theta(a_{i,t})$
- Token-level noise averages out (law of large numbers). One clip decision per sequence.
- Substantially stabilized MoE RL training without Routing Replay.
Tradeoff: outlier tokens can suppress entire sequence gradient, or be masked by normal tokens.

Zheng et al., "Group Sequence Policy Optimization," arXiv, Alibaba Qwen, 2025.

Methods ↔ Problems Summary

Problem	Methods	Core idea
Zero-gradient prompts	DAPO, ScaleRL	Ensure every prompt contributes signal
Symmetric clip too tight	DAPO	Asymmetric bounds for positive-advantage tokens
Gradient death at boundary	CISPO, SAPO	Detach ratio, gradient through $\log \pi_\theta$
Ratio measures wrong thing	DPPO	Divergence-based mask instead of ratio clip
$\sigma$ normalization bias	Dr. GRPO	Drop $\sigma$, use $r - \mu$
Length normalization bias	Dr. GRPO	$1/T_\text{max}$ instead of $1/T_i$
Loss aggregation bias	ScaleRL	Prompt-level averaging
MoE ratio instability	GSPO	Sequence-level ratio

Three Structural Axes

Axis 1 — Gradient source

Raw $\nabla_\theta \log \pi_\theta$: REINFORCE
Ratio $\rho_t$ (gradient = $\rho_t \cdot \nabla_\theta \log \pi_\theta$): PPO, GRPO, DAPO, GSPO, Dr.GRPO, DPPO
Hybrid (detached $\rho$ + $\nabla_\theta \log \pi_\theta$): CISPO, SAPO

Axis 2 — Trust region

None: REINFORCE | Symmetric clip: PPO, GRPO, Dr.GRPO | Asymmetric clip: DAPO
Clamped detached weight: CISPO | Sigmoid gate: SAPO | Seq-level clip: GSPO | Divergence mask: DPPO

Axis 3 — Advantage / Objective

GAE per-token (critic): PPO | $(r-\mu)/\sigma$ group-norm: GRPO, DAPO, CISPO, GSPO, SAPO
$r - \mu$ (no $\sigma$): Dr.GRPO | Batch-level norm: ScaleRL

Part 3

Synthesis:
ScaleRL, Recipes, and Adoption

ScaleRL — The Scaling Framework

400k+ GPU-hours of systematic ablations across RL design choices (Meta / UT Austin, 2025).
Fits sigmoidal curves to predict the ultimate performance ceiling of an RL recipe from early-stage runs, without full runs.
Validated by running 100K GPU-hours to convergence, showing a fit on the first 50K accurately predicts the remaining trajectory.

Khatri et al., "The Art of Scaling Reinforcement Learning Compute for LLMs," arXiv, Meta / UT Austin, 2025.

ScaleRL Key Findings

Shifts $A$ (the ceiling)

Loss type: CISPO/GSPO achieve substantially higher $A$ than DAPO.
FP32 logits in LM head: single largest asymptotic gain.

Shifts $B$ (efficiency) only, not $A$

Advantage normalization: prompt-level, batch-level, no normalization all yield similar $A$.
Loss aggregation: prompt-level and token-level yield similar $A$, both ahead of sample-level. Prompt-level more compute-efficient (higher $B$).
Async RL (PipelineRL): improves $B$ substantially over PPO-off-policy. Same $A$.

The ScaleRL Recipe

One concrete instantiation combining the study's best practices:

Loss

CISPO

Detached clamped ratio × $\nabla_\theta \log \pi_\theta$

Aggregation

Prompt-level

No length bias

Advantage

Batch-level norm

No $\sigma$ normalization

Precision

FP32 logits

At LM head

Data

Zero-var filter

Retire easy prompts (≥ 0.9)

Infra

Async RL

PipelineRL, $k=8$

Which Models Used What?

Model	PO Method	Post-training pipeline
DeepSeek-R1 671B MoE, 37B active; Jan 2025	GRPO	(1) Cold-start SFT → (2) reasoning RL → (3) rejection sampling SFT → (4) alignment RL.
DeepSeek-R1-Distill 1.5B–70B; Jan 2025	—	No direct RL. Distillation of R1's reasoning outputs into Qwen2.5 and Llama3 base models via SFT. "We believe that applying RL to the distilled models would yield significant further improvements, which we leave for future work."
Qwen3 flagship 235B-A22B; Apr 2025	GSPO	(1) Long-CoT cold-start SFT → (2) reasoning RL → (3) thinking mode fusion SFT (learn to toggle thinking/non-thinking) → (4) general RL (20+ task domains).
Qwen3 small 0.6B–14B; Apr 2025	—	No direct RL. Strong-to-weak distillation from flagship: (1) off-policy distillation → (2) on-policy distillation. "Distillation from advanced teacher models significantly outperforms reinforcement learning in performance and training efficiency."
MiniMax-M1 456B MoE, 45.9B active; Jun 2025	CISPO	(1) Continual pretraining (reasoning-intensive) → (2) cold-start SFT → (3) CISPO RL.
MiniMax-M2.5 230B MoE, 10B active; Feb 2026	CISPO	CISPO RL at scale across 200k+ real-world environments (code, office, web). Process reward for long-horizon credit assignment. Detailed pipeline not disclosed.
GLM-5 744B MoE, 40B active; Feb 2026	undisclosed	(1) SFT → (2) reasoning RL → (3) agentic RL → (4) general RL.

Training Pipeline Context

Policy Optimization is one stage in a multi-stage pipeline. Common pattern:

SFT

→

Reasoning RL

→

(optional stages)

→

General RL

Trend: later models expand RL scope. R1 and Qwen3 focus on reasoning RL. GLM-5 adds a dedicated agentic RL stage. MiniMax-M2.5 scales RL across 200k+ real-world environments (code, office, web).
Small models skip RL entirely. Both DeepSeek-R1-Distill and Qwen3 small use distillation from flagship instead. Qwen3 explicitly claims distillation outperforms RL for smaller models; DeepSeek-R1 reports it for the Qwen2.5-32B case.
- NVIDIA's AceReason-Nemotron shows distillation → RL works in small models: GRPO on top of DeepSeek-R1-Distill-Qwen-7B yields +14.6pp on AIME 2025 and +6.8pp on LiveCodeBench (7B, math-only RL).

Takeaways

One equation underlies everything: $\nabla_\theta J = \mathbb{E}[\nabla_\theta \log \pi_\theta \cdot A]$. All methods are strategies for making this gradient more accurate, lower variance, or more stable.
Critic-free training won for RLVR. Per-token credit (PPO/GAE) lost to simplicity + scale.
Softer trust regions beat hard clipping. CISPO's "never kill gradient" outperforms PPO-style masking and is most robust to hyperparameters (ScaleRL).
Not all recipes scale equally. Loss type and FP32 precision shift the ceiling $A$. Most other choices only modulate efficiency $B$.
$\mathcal{L}(\theta)$: one scalar, one chain rule. Everything since slide 2 is about choosing the right scalar.

Policy Optimization for LLMs:From One Loss to Scalable Recipes