2026.04.08 | Video-MME-v2地狱题库拷打模型；Claw-Eval全程审计守卫可信代理 - HuggingFace 每日AI论文速递

【赞助商】

通勤路上就听AI每周谈。AI每周谈，每周带你回顾上周AI大事

【目录】

本期的 15 篇论文如下：

00:34 🎯 Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding（Video-MME-v2：迈向全面视频理解基准的下一个阶段）

01:19 🔬 Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents（Claw-Eval：迈向可信赖的自主智能体评估）

02:06 🤖 Learning to Retrieve from Agent Trajectories（从智能体轨迹中学习检索）

02:53 🧪 ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation（ACES：谁来测试测试？代码生成的留一法AUC一致性）

03:42 👗 Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision（Vanast：基于合成三元组监督的虚拟试穿与人体图像动画）

04:31 ⏱ Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning（超越准确率：揭示工具集成推理中的低效模式）

05:23 🧠 ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement（ThinkTwice：联合优化大型语言模型的推理与自我精炼能力）

06:03 🔍 Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework（论文圈：一个开源的多智能体研究文献发现与分析框架）

06:52 🔍 How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings（智能体技能在真实场景中的效用评估：基准测试LLM在现实环境下的技能使用）

07:33 🚀 MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU（MegaTrain：在单GPU上全精度训练1000亿+参数大语言模型）

08:11 🛠 DARE: Diffusion Large Language Models Alignment and Reinforcement Executor（DARE：扩散大语言模型的对齐与强化执行器）

08:54 🧠 In-Place Test-Time Training（原位测试时训练）

09:39 🎬 Watch Before You Answer: Learning from Visually Grounded Post-Training（先看后答：基于视觉基础的后训练学习）

10:13 🔍 Demystifying When Pruning Works via Representation Hierarchies（通过表征层次解析剪枝何时有效）

10:59 🤖 Action Images: End-to-End Policy Learning via Multiview Video Generation（动作图像：通过多视角视频生成的端到端策略学习）

【关注我们】

您还可以在以下平台找到我们，获得播客内容以外更多信息

小红书: AI速递