-
Post-Training 面试速记: 从分词到对齐
从 MinHash 去重到 BPE/WordPiece/Unigram、Softmax 手推、FlashAttention、RM/PPO/GRPO/DPO,每节给出 Key Insight、实现层次图、代码骨架、LaTeX 推导和高频追问。
-
From PPO to Search-R1: The Design Space of Reasoning and Agentic RL
A component-by-component decomposition of PPO through GRPO, verifiers, and retrieved-token masking to a complete instance of Agentic RL.