2026.04.08 | Video-MME-v2地狱题库拷打模型;Claw-Eval全程审计守卫可信代理

2026.04.08 | Video-MME-v2地狱题库拷打模型;Claw-Eval全程审计守卫可信代理

12分钟 ·
播放数114
·
评论数0

【赞助商】

通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事

传送门 🔗www.xiaoyuzhoufm.com

【目录】

本期的 15 篇论文如下:

00:34 🎯 Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding(Video-MME-v2:迈向全面视频理解基准的下一个阶段)

01:19 🔬 Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents(Claw-Eval:迈向可信赖的自主智能体评估)

02:06 🤖 Learning to Retrieve from Agent Trajectories(从智能体轨迹中学习检索)

02:53 🧪 ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation(ACES:谁来测试测试?代码生成的留一法AUC一致性)

03:42 👗 Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision(Vanast:基于合成三元组监督的虚拟试穿与人体图像动画)

04:31 ⏱ Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning(超越准确率:揭示工具集成推理中的低效模式)

05:23 🧠 ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement(ThinkTwice:联合优化大型语言模型的推理与自我精炼能力)

06:03 🔍 Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework(论文圈:一个开源的多智能体研究文献发现与分析框架)

06:52 🔍 How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings(智能体技能在真实场景中的效用评估:基准测试LLM在现实环境下的技能使用)

07:33 🚀 MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU(MegaTrain:在单GPU上全精度训练1000亿+参数大语言模型)

08:11 🛠 DARE: Diffusion Large Language Models Alignment and Reinforcement Executor(DARE:扩散大语言模型的对齐与强化执行器)

08:54 🧠 In-Place Test-Time Training(原位测试时训练)

09:39 🎬 Watch Before You Answer: Learning from Visually Grounded Post-Training(先看后答:基于视觉基础的后训练学习)

10:13 🔍 Demystifying When Pruning Works via Representation Hierarchies(通过表征层次解析剪枝何时有效)

10:59 🤖 Action Images: End-to-End Policy Learning via Multiview Video Generation(动作图像:通过多视角视频生成的端到端策略学习)

【关注我们】

您还可以在以下平台找到我们,获得播客内容以外更多信息

小红书: AI速递