2026.01.09 | GDPO解耦奖励优化多任务;可学习乘数解锁矩阵尺度

2026.01.09 | GDPO解耦奖励优化多任务;可学习乘数解锁矩阵尺度

10分钟 ·
播放数123
·
评论数0

本期的 15 篇论文如下:

00:21 📈 GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization(GDPO:面向多奖励强化学习优化的组奖励解耦归一化策略优化)

01:05 ⚖ Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers(可学习的乘数:释放语言模型矩阵层的尺度)

01:33 🌙 RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes(RL-AWB:基于深度强化学习的低光照夜间场景自动白平衡校正)

02:07 🤖 RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation(RoboVIP:基于视觉身份提示的多视角视频生成增强机器人操作)

02:56 🤝 RelayLLM: Efficient Reasoning via Collaborative Decoding(RelayLLM:基于协作解码的高效推理框架)

03:31 🌲 AT$^2$PO: Agentic Turn-based Policy Optimization via Tree Search(AT²PO:基于树搜索的智能体回合制策略优化)

04:24 🤔 VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice(VideoAuto-R1:通过思考一次、回答两次实现视频自动推理)

04:57 🎬 VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control(VerseCrafter:具有4D几何控制的动态逼真视频世界模型)

05:34 🔍 The Illusion of Specialization: Unveiling the Domain-Invariant "Standing Committee" in Mixture-of-Experts Models(专业化的幻象:揭示混合专家模型中的领域不变“常务委员会”)

06:09 🎯 Few Tokens Matter: Entropy Guided Attacks on Vision-Language Models(少数令牌至关重要:针对视觉语言模型的熵引导攻击)

06:40 🎥 Plenoptic Video Generation(全光视频生成)

07:12 ⚖ Agent-as-a-Judge(智能体作为评审者)

07:43 📄 DocDancer: Towards Agentic Document-Grounded Information Seeking(DocDancer:面向智能体化的文档驱动信息检索)

08:20 🧠 Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing(Re-Align:基于结构化推理引导对齐的上下文图像生成与编辑)

09:05 🧠 DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs(DiffCoT:大语言模型中的扩散风格思维链推理)

【关注我们】

您还可以在以下平台找到我们,获得播客内容以外更多信息

小红书: AI速递