2026.01.07 | 无限深度任意采样;端到端语音转录分离

2026.01.07 | 无限深度任意采样;端到端语音转录分离

12分钟 ·
播放数112
·
评论数1

本期的 15 篇论文如下:

00:25 🔍 InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields(InfiniDepth:基于神经隐式场的任意分辨率与细粒度深度估计)

01:07 🎙 MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization(MOSS转录与说话人分离:带说话人归属和时间戳的准确转录)

01:46 🔬 SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence(SciEvalKit:一个用于科学通用智能的开源评估工具包)

02:32 🎬 LTX-2: Efficient Joint Audio-Visual Foundation Model(LTX-2:高效的联合视听基础模型)

03:26 🦄 UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision(UniCorn:通过自生成监督实现自改进统一多模态模型)

04:06 🎨 DreamStyle: A Unified Framework for Video Stylization(DreamStyle:视频风格化的统一框架)

04:38 🧠 CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving(CogFlow:通过知识内化桥接感知与推理,用于视觉数学问题求解)

05:25 ⚡ MiMo-V2-Flash Technical Report(MiMo-V2-Flash 技术报告)

06:15 🎮 NitroGen: An Open Foundation Model for Generalist Gaming Agents(NitroGen:通用游戏智能体的开放基础模型)

06:58 🤖 SOP: A Scalable Online Post-Training System for Vision-Language-Action Models(SOP:一种可扩展的视觉-语言-动作模型在线后训练系统)

07:43 🛡 OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs(OpenRT:一个用于多模态大语言模型的开源红队测试框架)

08:31 📍 The Sonar Moment: Benchmarking Audio-Language Models in Audio Geo-Localization(声纳时刻:音频语言模型在音频地理定位中的基准测试)

09:14 🔍 X-MuTeST: A Multilingual Benchmark for Explainable Hate Speech Detection and A Novel LLM-consulted Explanation Framework(X-MuTeST:一个用于可解释仇恨言论检测的多语言基准及一种新颖的LLM咨询解释框架)

09:57 🧠 Parallel Latent Reasoning for Sequential Recommendation(并行潜在推理用于序列推荐)

10:27 🤖 WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks(WebGym:利用真实任务扩展视觉网络代理的训练环境)

【关注我们】

您还可以在以下平台找到我们,获得播客内容以外更多信息

小红书: AI速递

展开Show Notes
标记:自己看第八篇和第十四篇