2026.02.23 | VESPO防抖离线RL;推理模型学会“点到为止”

2026.02.23 | VESPO防抖离线RL;推理模型学会“点到为止”

10分钟 ·
播放数109
·
评论数0

【赞助商】

通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事

传送门 🔗www.xiaoyuzhoufm.com

【目录】

本期的 10 篇论文如下:

00:40 ⚖ VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training(VESPO:用于稳定离策略LLM训练的变分序列级软策略优化)

01:45 💭 Does Your Reasoning Model Implicitly Know When to Stop Thinking?(你的推理模型是否隐含地知道何时停止思考?)

02:44 🎮 Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control(生成现实:基于交互式视频生成与手部和相机控制的人本世界模拟)

03:24 🤖 EgoPush: Learning End-to-End Egocentric Multi-Object Rearrangement for Mobile Robots(EgoPush:面向移动机器人的端到端自我中心多物体重排学习)

04:11 🤖 SARAH: Spatially Aware Real-time Agentic Humans(SARAH:具备空间感知能力的实时拟人化智能体)

05:05 🎬 VidEoMT: Your ViT is Secretly Also a Video Segmentation Model(VidEoMT:你的ViT模型暗中也是一个视频分割模型)

05:51 ✂ Sink-Aware Pruning for Diffusion Language Models(面向扩散语言模型的汇点感知剪枝)

06:36 🎯 Selective Training for Large Vision Language Models via Visual Information Gain(基于视觉信息增益的大型视觉语言模型选择性训练)

07:18 🧮 DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning(DeepVision-103K:一个视觉多样、覆盖广泛且可验证的多模态推理数学数据集)

08:16 🤖 Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty(通过动作雅可比惩罚学习平滑时变线性策略)

【关注我们】

您还可以在以下平台找到我们,获得播客内容以外更多信息

小红书: AI速递