按组件拆解 PPO,然后沿 GRPO、Verifier、Retrieved Token Masking 一路梳理到 Agentic RL 的完整实例
23 min read · April 17, 2026
2026 · reinforcement-learning llm reasoning agentic-rl ppo grpo · technical-notes