注:本文包含 AI 辅助创作
整体说明
- 在 2025 年的 LLM RL 研究中,利用 Rubrics(评分细则/量规) 来构建更精细、可解释的奖励函数已经成为主流趋势之一
- 这种方法主要解决传统标量奖励(Scalar Reward)无法提供细粒度指导的问题
- 本文总结相关领域具有代表性的论文,论文详细阅读见其他内容
RaR
- 原始论文:(RaR)Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains, Scale AI, 20251003
- 该论文提出了“可验证奖励强化学习”(RLVR)的概念,证明了在复杂推理任务中,使用明确的 Rubrics 作为奖励信号比单纯的人类偏好更有效,能够显著提升模型推理的正确性
DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research
- 原始博客:DR Tulu: An open, end-to-end training recipe for long-form deep research, 20251118, AI2
- 原始论文:DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research, 20251124 & 20251126, AI2
- 针对深度研究(Deep Research)场景,提出了一种“演化细则”(Evolving Rubrics)机制
- 随着模型能力的提升,评分标准也会动态调整,从而引导模型适应性地使用工具并完成更复杂的科学问答任务
RubricRL: Simple Generalizable Rewards for Text-to-Text Generation
- 原始论文:RubricRL: Simple Generalizable Rewards for Text-to-Image Generation, 20251125, Microsoft CoreAI
- 文生图 Rubric,提出了一个名为 RubricRL 的通用框架,旨在为文本生成任务设计简单且可泛化的基于 Rubric 的奖励
- 该方法强调了奖励设计的可解释性和可组合性,使用户能更灵活地定制模型行为
AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction
- 原始论文:AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following, 20251113 & 20251126, Meta Superintelligence Labs & CMU
- 该工作发布了 AdvancedIF 框架,利用基于 Rubric 的流水线来提升大模型的指令遵循(Instruction Following)能力
- 不仅将 Rubric 用于评估(Benchmarking),还将其直接用于 RL 训练环节
(Rubicon) Reinforcement Learning with Rubric Anchors
- 原始论文:(Rubicon) Reinforcement Learning with Rubric Anchors, 20250818, Inclusion AI & Ant Group & Zhejiang University
- 探讨了在 RLVR(可验证奖励 RL)范式下,如何利用“Rubric Anchors”(评分锚点)来增强大模型
- 通过锚点机制,模型能够更稳定地对齐到预期的细粒度标准上
(RuscaRL) Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning
- 原始论文:(RuscaRL) Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning, 20250823-20251022, ZJU
- 提出了 RuscaRL 框架,将 Rubric 作为一种教学脚手架(Instructional Scaffolding)
- 该方法旨在帮助模型突破复杂任务中的“探索瓶颈”,通过结构化的细则引导模型逐步探索出正确的策略
Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning
- 原始论文:Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning, 20250919, Ant Group
- 针对开放式推理任务,提出了一种自我奖励机制
- 模型能够根据预设的 Rubric 对自己的输出进行评分和反馈,从而在缺乏外部大规模标注的情况下实现自我迭代和提升
RLAC: Reinforcement Learning with Adversarial Critic for Dynamic Rubric Generation
- 原始论文:RLAC: Reinforcement Learning with Adversarial Critic for Free-Form Generation Tasks, 20251103, SJTU & UC Berkeley
- 提出了一种结合对抗性 Critic 的强化学习方法(RLAC),通过动态生成的 Rubric 来应对训练过程中的挑战,属于Post-training阶段的优化策略
PaTaRM: Bridging Pairwise and Pointwise Signals via Preference-Aware Task-Adaptive Reward Modeling
Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training
Auto-Rubric: Learning to Extract Generalizable Criteria for Reward Modeling
- 原始论文:Auto-Rubric: Learning to Extract Generalizable Criteria for Reward Modeling, 20251020
- 构造静态 Rubrics
(Self-Rewarding Rubrics) Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning
- 原始论文:Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning, 20250919
- 将策略自己用作 Rubrics 生成器
QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA
AutoRubric-R1V: Rubric-Based Generative Rewards for Faithful Multimodal Reasoning
- 原始论文:AutoRubric-R1V: Rubric-Based Generative Rewards for Faithful Multimodal Reasoning, 20251016, University of Notre Dame
- 多模态 Rubric
(DeepSeek-GRM)Inference-Time Scaling for Generalist Reward Modeling
- 原始论文:(DeepSeek-GRM)Inference-Time Scaling for Generalist Reward Modeling, DeepSeek & THU, 20250403-20250925
- 推出了 DeepSeek-GRM 模型,是 Pointwise GRM,模型地址:huggingface.co/collections/BBQGOD/deepseek-grm