2026.01.23 | BayesianVLA逼模型“读心”；扩散模型“按顺序”更聪明 - HuggingFace 每日AI论文速递

【赞助商】

通勤路上就听AI每周谈。AI每周谈，每周带你回顾上周AI大事

【目录】

本期的 15 篇论文如下：

00:32 🤖 BayesianVLA: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries（BayesianVLA：通过潜在动作查询对视觉语言动作模型进行贝叶斯分解）

01:22 ⚠ The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models（灵活性陷阱：为何任意顺序生成会限制扩散语言模型的推理潜力）

02:26 🎥 HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding（HERMES：将KV缓存作为分层内存以实现高效流式视频理解）

03:14 🚀 EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic Experience（EvoCUA：通过从可扩展合成经验中学习来演化计算机使用智能体）

04:02 🧪 LLM-in-Sandbox Elicits General Agentic Intelligence（沙盒中的LLM激发通用智能体智能）

04:54 🚀 Stable-DiffCoder: Pushing the Frontier of Code Diffusion Large Language Model（Stable-DiffCoder：推进代码扩散大语言模型的前沿）

05:34 🎭 SAMTok: Representing Any Mask with Two Words（SAMTok：用两个词表示任意掩码）

06:30 🚀 Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders（使用表征自编码器扩展文本到图像扩散变换器）

07:23 🔬 Learning to Discover at Test Time（在测试时学习发现）

08:08 🔍 Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing（重新思考组合图像检索评估：一个源自图像编辑的细粒度基准）

09:06 ⚙ Towards Automated Kernel Generation in the Era of LLMs（大语言模型时代的自动化内核生成研究）

09:48 🔄 OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation（OpenVision 3：一个用于理解和生成的统一视觉编码器家族）

10:45 💻 Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces（终端基准测试：在命令行界面中对智能体进行困难、现实任务的基准评估）

11:29 🗣 Qwen3-TTS Technical Report（Qwen3-TTS技术报告）

12:13 🤖 Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning（Cosmos策略：通过微调视频模型实现视觉运动控制与规划）

【关注我们】

您还可以在以下平台找到我们，获得播客内容以外更多信息

小红书: AI速递