Notes

My research notes on Machine Learning, Systems, Efficiency, and Agents

Post-Training 面试速记: 从分词到对齐

从 MinHash 去重到 BPE/WordPiece/Unigram、Softmax 手推、FlashAttention、RM/PPO/GRPO/DPO，每节给出 Key Insight、实现层次图、代码骨架、LaTeX 推导和高频追问。

67 min read · May 17, 2026

2026 · post-training rlhf tokenizer ppo grpo dpo flashattention · notes
From PPO to Search-R1: The Design Space of Reasoning and Agentic RL

A component-by-component decomposition of PPO through GRPO, verifiers, and retrieved-token masking to a complete instance of Agentic RL.

57 min read · April 17, 2026

2026 · reasoning agentic-rl