SRPO: A Game-Changer for Efficient Reinforcement Learning in LLMs
The recent breakthroughs in reinforcement learning (RL) for large language models (LLMs), exemplified by OpenAI's o1 series and DeepSeek-R1, have shown immense potential for enhancing reasoning abilities. However, the standard GRPO (Group Relative Policy Optimization) method faces significant hurdles: performance bottlenecks, inefficient sample use, and difficulty cultivating specialized reasoning when mixing different data types like math and code. Enter SRPO (Two-Staged history-Resampling Policy Optimization) from Kuaishou's Kwaipilot team. This novel framework tackles these challenges head-on, achieving DeepSeek-R1-Zero-level performance in both math and code domains with just one-tenth of the training steps. Here's how it works.
What are the main challenges with standard GRPO training for reasoning models?
Standard GRPO training encounters two primary obstacles when applied to mixed-domain reasoning tasks. First, cross-domain optimization conflicts emerge between math and code problems. Mathematical data naturally prompts longer, more detailed reasoning trajectories (Long CoT), while code data does not. Directly mixing these types leads to suboptimal performance in both domains. Second, there's a problem with similar group rewards. The GRPO algorithm calculates advantage based on variance of non-zero rewards within a sampled group. When all rollouts yield nearly identical rewards, the advantage approaches zero, rendering effective gradient contributions minimal. This drastically reduces training efficiency, especially when a large portion of the batch suffers from this issue. Consequently, the model struggles to reach desired performance levels like those seen in R1-Zero, despite high computational costs.

How does SRPO address cross-domain optimization conflicts between math and code?
SRPO employs a two-staged training approach to resolve cross-domain conflicts. The first stage focuses on training the model on mathematical reasoning data alone, allowing it to develop strong long-chain-of-thought (Long CoT) capabilities. Once this foundation is solid, the second stage introduces code data through a specialized resampling technique. This staged methodology prevents the conflicting tendencies of math (which favors lengthy reasoning) and code (which is less inclined toward it) from interfering with each other. By isolating each domain's influence at different phases, SRPO ensures that the model can specialize in both areas without compromise. The resampling mechanism further enhances sample efficiency, enabling the model to learn robust reasoning patterns from mixed-domain datasets without the performance degradation seen in vanilla GRPO.
What is the "similar group rewards" issue and how does SRPO tackle it?
The similar group rewards problem arises when rollouts within a sampled group produce nearly identical reward values. In standard GRPO, the advantage is derived from variance of non-zero rewards—so if all rewards are similar, the advantage shrinks to zero. This leads to vanishing gradient contributions for a large portion of the training batch, severely hindering learning progress. SRPO counters this with a history-resampling technique that dynamically adjusts which samples form groups. Instead of using fixed groups, SRPO periodically re-samples from a history buffer, ensuring greater diversity in reward values within each group. This maintains sufficient variance for meaningful advantage calculations, keeping gradients effective. The two-staged training also helps: by first stabilizing math reasoning, the model produces more varied reward signals when code data is introduced later, further mitigating the issue.
What benchmark results did SRPO achieve compared to DeepSeek-R1-Zero?
SRPO delivered outstanding performance on key benchmarks. Using the same base model as DeepSeek (Qwen2.5-32B) and pure reinforcement learning training, SRPO scored 50 on AIME24 (a challenging math benchmark) and 41.6 on LiveCodeBench (a coding benchmark). These numbers surpass the DeepSeek-R1-Zero-32B model's performance on the same tests. Remarkably, SRPO achieved these results while requiring only one-tenth of the training steps used by R1-Zero. This marks the first instance where a model has reached R1-Zero-level proficiency in both mathematical and coding domains simultaneously, despite significantly lower computational cost. The efficiency gain is attributed to SRPO's innovative two-staged history-resampling strategy, which eliminates wasteful training iterations and focuses learning on high-value samples.

How does SRPO achieve 10x efficiency with fewer training steps?
SRPO's efficiency stems from two core innovations: staged training and history resampling. By splitting training into distinct phases for math and code, the model avoids the destructive interference that plagues mixed-domain training. This means each training step is more productive. Additionally, the history-resampling mechanism ensures that every group of rollouts maintains sufficient reward variance, keeping the RL signal strong. Vanilla GRPO often wastes steps on groups with near-identical rewards, where learning stalls. SRPO's resampling adaptively discards such low-information groups and prioritizes diverse experiences. Together, these techniques reduce the number of training steps needed by 90% compared to R1-Zero, yet yield superior or equal performance. The result is a reinforcement learning framework that scales economically without sacrificing model capability.
What makes SRPO the first to achieve R1-Zero-level performance in both math and code?
Prior to SRPO, no pure reinforcement learning approach had simultaneously reached DeepSeek-R1-Zero-level performance on both mathematical reasoning (AIME24) and code generation (LiveCodeBench). Models typically excelled in one domain at the expense of the other due to optimization conflicts. SRPO's breakthrough lies in its two-staged design: the initial math-only phase builds a strong foundation for long-chain reasoning, which is then leveraged and adapted for code tasks in the second stage without catastrophic forgetting. The history-resampling technique further ensures that the RL training remains stable and efficient across domains. Published as a technical report by Kuaishou's Kwaipilot team, SRPO is also open-sourced (SRPO-Qwen-32B), allowing the community to verify and build upon this achievement. This marks a significant step toward practical, domain-agnostic reinforcement learning for LLMs.