2026.05.20 | 反自蒸馏优化推理;可验证环境测评智能体

2026.05.20 | 反自蒸馏优化推理;可验证环境测评智能体

14分钟 ·
播放数68
·
评论数0

【目录】
本期的 15 篇论文如下:
[00:24] 🧠 Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information(基于点互信息的反自蒸馏用于推理强化学习)
[01:08] 🖥 OpenComputer: Verifiable Software Worlds for Computer-Use Agents(OpenComputer:为计算机使用智能体构建可验证的软件世界)
[01:53] 🧠 GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment(GoLongRL:面向能力的长上下文强化学习与多任务对齐)
[02:49] 🔬 Process Rewards with Learned Reliability(具有学习可靠性的过程奖励模型)
[03:44] 🤖 AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration(AutoResearchClaw:基于人类-人工智能协作的自我强化自主研究)
[04:48] 🎭 When Vision Speaks for Sound(当视觉为声音代言)
[05:50] 🏭 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL(EnvFactory:通过可执行环境合成与鲁棒强化学习扩展工具使用型智能体)
[06:45] 🎬 CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition(CogOmniControl: 基于推理驱动的可控视频生成与创意意图认知)
[07:40] 🎯 Active Learners as Efficient PRP Rerankers(主动学习器作为高效的成对排序提示重排序器)
[08:24] 🎥 Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos(Artifact-Bench:评估多模态大语言模型在检测与评估AI生成视频伪影方面的能力)
[09:14] 🎬 Aurora: Unified Video Editing with a Tool-Using Agent(Aurora:使用工具型代理的统一视频编辑框架)
[10:12] 🎯 CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization(对比证据策略优化:基于强化学习与可验证奖励的自蒸馏方法)
[11:01] 📱 OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments(OmniGUI:在全模态智能手机环境中评估GUI代理的基准测试)
[11:51] 🎬 MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation(MSAVBench:迈向全面且可靠的多镜头音视频生成评估)
[12:44] 🎥 Video Models Can Reason with Verifiable Rewards(视频模型可通过可验证奖励进行推理)

【关注我们】
您还可以在以下平台找到我们,获得播客内容以外更多信息
小红书: AI速递