2025.11.21 | V-ReasonBench考视频模型推理；Step-Audio-R1让语音越“想”越强 - HuggingFace 每日AI论文速递

本期的 15 篇论文如下：

00:22 📊 V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models（V-ReasonBench：面向视频生成模型的统一推理基准套件）

01:06 🧠 Step-Audio-R1 Technical Report（Step-Audio-R1技术报告）

01:48 🧭 Scaling Spatial Intelligence with Multimodal Foundation Models（通过多模态基础模型扩展空间智能）

02:18 🎬 First Frame Is the Place to Go for Video Content Customization（首帧是实现视频内容定制化的关键所在）

02:49 🎬 Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO（视频即答案：使用联合GRPO预测并生成下一视频事件）

03:29 🔮 SAM 3D: 3Dfy Anything in Images（SAM 3D：图像中任意物体的三维化）

04:03 🚀 MiMo-Embodied: X-Embodied Foundation Model Technical Report（MiMo-Embodied：跨具身基础模型技术报告）

04:38 🧠 Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation（边生成边思考：在视觉生成中交织文本推理）

05:10 🏆 TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval（TurkColBERT：土耳其信息检索中稠密与延迟交互模型的基准研究）

05:53 🌀 Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs（Nemotron Elastic：迈向高效多合一推理大语言模型）

06:26 🚀 SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models（自参考策略优化：面向视觉-语言-动作模型）

07:09 🎬 TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding（TimeViper：一种用于高效长视频理解的混合Mamba-Transformer视觉语言模型）

07:46 🔬 SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking（SAM2S：通过语义长期跟踪实现手术视频中的任意分割）

08:23 🎨 NaTex: Seamless Texture Generation as Latent Color Diffusion（NaTex：作为潜在颜色扩散的无缝纹理生成）

08:58 📐 PartUV: Part-Based UV Unwrapping of 3D Meshes（PartUV：基于部件分割的3D网格UV展开方法）

【关注我们】

您还可以在以下平台找到我们，获得播客内容以外更多信息

小红书: AI速递