

2026.05.21 | Mega-ASR降噪减幻觉;Video2GUI数据预训练提效【目录】 本期的 15 篇论文如下: [00:23] 🎤 Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation(Mega-ASR:通过扩展真实世界声学模拟实现野外环境语音识别) [01:22] 🎬 Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining(Video2GUI:合成大规模交互轨迹以实现通用型GUI代理预训练) [02:11] 🎬 Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos(增强无训练无限帧生成以实现一致的长视频) [03:04] 🚀 You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories(你仅需极简的RLVR训练:通过秩-1轨迹外推大语言模型) [03:50] 🗜 OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond(OScaR:面向大语言模型及更广领域的极致KV缓存量化的奥卡姆剃刀) [04:39] 🔧 IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools(IndusAgent:利用智能工具增强开放词汇工业异常检测) [05:36] 🔊 A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook(大型音频语言模型综述:泛化、可信度与展望) [06:35] 🤝 It Takes Two: Complementary Self-Distillation for Contextual Integrity in LLMs(双管齐下:面向大语言模型语境完整性的互补式自蒸馏框架) [07:26] 📈 Toto 2.0: Time Series Forecasting Enters the Scaling Era(Toto 2.0:时间序列预测进入规模化时代) [08:20] ⚡ Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs(混合量化:面向智能体大语言模型的量化预填充与精确解码) [09:25] 🧠 Generative Recursive Reasoning(生成式递归推理) [10:29] 🎬 CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing(CutVerse:面向媒体后期制作编辑的组合式GUI智能体基准) [11:22] 🖼 Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning(Uni-Edit:智能编辑作为统一模型调优的通用任务) [12:08] 🧠 LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening(LLMEval-Logic:一个求解器验证的中文逻辑推理基准测试,具备对抗性强化) [13:07] ⚡ HRM-Text: Efficient Pretraining Beyond Scaling(HRM-Text:超越规模的高效预训练) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
2026.05.20 | 反自蒸馏优化推理;可验证环境测评智能体【目录】 本期的 15 篇论文如下: [00:24] 🧠 Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information(基于点互信息的反自蒸馏用于推理强化学习) [01:08] 🖥 OpenComputer: Verifiable Software Worlds for Computer-Use Agents(OpenComputer:为计算机使用智能体构建可验证的软件世界) [01:53] 🧠 GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment(GoLongRL:面向能力的长上下文强化学习与多任务对齐) [02:49] 🔬 Process Rewards with Learned Reliability(具有学习可靠性的过程奖励模型) [03:44] 🤖 AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration(AutoResearchClaw:基于人类-人工智能协作的自我强化自主研究) [04:48] 🎭 When Vision Speaks for Sound(当视觉为声音代言) [05:50] 🏭 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL(EnvFactory:通过可执行环境合成与鲁棒强化学习扩展工具使用型智能体) [06:45] 🎬 CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition(CogOmniControl: 基于推理驱动的可控视频生成与创意意图认知) [07:40] 🎯 Active Learners as Efficient PRP Rerankers(主动学习器作为高效的成对排序提示重排序器) [08:24] 🎥 Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos(Artifact-Bench:评估多模态大语言模型在检测与评估AI生成视频伪影方面的能力) [09:14] 🎬 Aurora: Unified Video Editing with a Tool-Using Agent(Aurora:使用工具型代理的统一视频编辑框架) [10:12] 🎯 CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization(对比证据策略优化:基于强化学习与可验证奖励的自蒸馏方法) [11:01] 📱 OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments(OmniGUI:在全模态智能手机环境中评估GUI代理的基准测试) [11:51] 🎬 MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation(MSAVBench:迈向全面且可靠的多镜头音视频生成评估) [12:44] 🎥 Video Models Can Reason with Verifiable Rewards(视频模型可通过可验证奖励进行推理) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
2026.05.19 | 长视频生成提速降显存;轻量多模态模型超越大参数模型【目录】 本期的 15 篇论文如下: [00:23] 🎬 LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation(LongLive-2.0:用于长视频生成的NVFP4并行基础设施) [01:17] 🎨 Lance: Unified Multimodal Modeling by Multi-Task Synergy(Lance:通过多任务协同实现统一多模态建模) [02:24] 🤖 AI for Auto-Research: Roadmap & User Guide(人工智能自动研究:路线图与用户指南) [03:26] 🛠 SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution(SkillsVote:从收集、推荐到演化的智能体技能全生命周期治理) [04:20] 🎬 KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration(KVPO:基于KV语义探索的ODE原生GRPO自回归视频对齐方法) [05:18] 🏠 Code-as-Room: Generating 3D Rooms from Top-Down View Images via Agentic Code Synthesis(代码即房间:通过智能体代码合成从俯视图生成三维房间) [06:15] 🤖 OProver: A Unified Framework for Agentic Formal Theorem Proving(OProver:面向智能体形式定理证明的统一框架) [07:14] ⚡ Post-Trained MoE Can Skip Half Experts via Self-Distillation(通过自蒸馏实现后训练MoE跳过半数专家) [07:57] 🎥 LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs(LiteFrame:高效视觉编码器解锁视频大语言模型中的帧缩放) [08:47] 🛑 Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models(当推理收敛时停止:面向推理模型的语义保持型早停方法) [09:42] 🔀 Where Should Diffusion Enter a Language Model? Geometry-Guided Hidden-State Replacement(扩散应进入语言模型的何处?基于几何引导的隐状态替换) [10:39] 🧠 Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use(模型自适应工具必要性揭示大语言模型工具使用中的知行差距) [11:40] 🛡 StableVLA: Towards Robust Vision-Language-Action Models without Extra Data(稳定视觉-语言-动作模型:无需额外数据实现鲁棒性) [12:43] ⚡ CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection(紧凑注意力:通过块联合KV选择加速分块预填充) [13:41] 🧪 From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements(从可运行到可交付:面向全栈Web应用生成的多智能体测试驱动开发) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
2026.05.18 | 人类视频炼物理常识;文档问答要查原文【目录】 本期的 15 篇论文如下: [00:23] 🧠 PhysBrain 1.0 Technical Report(PhysBrain 1.0 技术报告) [00:56] 🔍 CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence(CiteVQA:为可信文档智能建立证据归因基准) [01:45] 🤖 MMSkills: Towards Multimodal Skills for General Visual Agents(MMSkills:面向通用视觉智能体的多模态技能) [02:35] 👗 FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization(FashionChameleon:面向实时且交互式的人体-服装视频定制) [03:20] 🦾 DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo(DexJoCo:面向任务型灵巧操作的MuJoCo基准测试与工具包) [04:19] 🔮 Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation(学会预见:揭示在线策略蒸馏的解锁效率) [04:54] 🖼 InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation(InsightTok:改进自回归图像生成中离散标记化的文本和人脸保真度) [05:48] 🧠 Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding(通过协作式逐步多教师解码蒸馏长链思维推理) [06:44] ⚡ Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization(Flash-GRPO:基于单步策略优化的高效视频扩散对齐方法) [07:29] 🧭 Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR(超越舒适区的助推:用于RLVR的高效策略引导探索) [08:10] 🎮 ReactiveGWM: Steering NPC in Reactive Game World Models(反应式游戏世界模型:在反应性游戏世界中操控非玩家角色) [08:46] ⚖ Hölder Policy Optimisation(赫尔德策略优化) [09:36] 🧠 Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution(Solvita:通过智能体进化增强大型语言模型在竞赛编程中的能力) [10:22] 🌐 CM-EVS: Sparse Panoramic RGB-D-Pose Data for Complete Scene Coverage(CM-EVS:用于完整场景覆盖的稀疏全景RGB-D-姿态数据) [11:05] 🎯 PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control(PAGER:弥合点精确几何GUI控制中的语义-执行鸿沟) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
【周末特辑】5月第3周最火AI论文 | MinT让百万LoRA秒挂基础模型;千层DiT均值尖叫MV-Split破局【目录】 本期的 5 篇论文如下: [00:44] TOP1(🔥211) | 🏗 MinT: Managed Infrastructure for Training and Serving Millions of LLMs(MinT:用于训练和服务数百万大语言模型的托管基础设施) [02:56] TOP2(🔥183) | 🧠 Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers(均值模式尖叫:面向千层扩散Transformer的均值-方差分裂残差) [04:44] TOP3(🔥171) | 🧠 SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture(SenseNova-U1: 基于NEO-unify架构统一多模态理解与生成) [06:36] TOP4(🔥145) | 🥇 Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling(通过简单且统一的缩放实现金牌级别的奥赛推理) [08:20] TOP5(🔥141) | 🔒 MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents(MemPrivacy:面向边缘-云智能体的隐私保护个性化记忆管理) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
2026.05.15 | 30B模型刷奥赛金牌;自蒸馏让3B小模型零外挂超能【目录】 本期的 15 篇论文如下: [00:23] 🥇 Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling(通过简单且统一的缩放实现金牌级别的奥赛推理) [01:00] 🤖 Self-Distilled Agentic Reinforcement Learning(自蒸馏智能体强化学习) [01:46] 🧠 MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models(MemLens:大型视觉语言模型中多模态长期记忆的基准测试) [02:57] 👁 MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory(MemEye:面向多模态智能体记忆的视觉中心评估框架) [04:00] 🎬 SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer(SANA-WM:高效分钟级世界建模的混合线性扩散Transformer) [04:43] 🎬 Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation(因果强制++:面向实时交互式视频生成的可扩展少步自回归扩散蒸馏) [05:21] 🧬 Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning(达尔文家族:基于MRI信任加权进化合并的无训练语言模型推理扩展) [06:19] 🐾 WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation(WildClawBench:面向真实世界长周期智能体评估的基准) [07:11] 🧠 STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?(STALE:LLM代理能否知晓其记忆何时失效?) [08:03] 🧠 Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems(超越个体智能:基于LLM的多智能体系统中的协作、故障归因与自我进化综述) [08:44] 🎥 Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video(扭曲即历史:从单个训练视频实现可泛化的相机控制视频生成) [09:24] 🧠 PREPING: Building Agent Memory without Tasks(PREPING:无需任务构建智能体记忆) [10:04] 🧭 RouteProfile: Elucidating the Design Space of LLM Profiles for Routing(RouteProfile:阐明用于路由的LLM配置文件设计空间) [10:49] 🧠 EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents(EvolveMem:面向LLM智能体的自演化记忆架构通过自动研究实现) [11:28] 🧠 ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both(ATLAS:是智能体推理还是潜在视觉推理?一个词就足够了) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
2026.05.14 | MinT用LoRA补丁解决大模型规模难题;MulTaBench对齐图文任务小模型胜大模型【目录】 本期的 15 篇论文如下: [00:25] 🏗 MinT: Managed Infrastructure for Training and Serving Millions of LLMs(MinT:用于训练和服务数百万大语言模型的托管基础设施) [01:08] 📊 MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image(MulTaBench:融合文本与图像的多模态表格学习基准测试) [02:14] 🎬 AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation(AnyFlow:任意步数视频扩散模型与在线流图蒸馏) [03:02] 📚 Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context(有效训练长上下文视觉语言模型,实现超越128K上下文的泛化能力) [03:48] 🤖 Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling(从有限交互中通过文本-表格建模预测AI代理的决策) [04:27] 🖼 Qwen-Image-VAE-2.0 Technical Report(千问图像变分自编码器2.0技术报告) [05:05] 🎨 Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling(编辑指南针和编辑奖励指南针:图像编辑与奖励建模的统一基准) [06:01] 🎯 TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking(TrackCraft3R:将视频扩散变换器重新用于密集3D跟踪) [06:57] 🧠 Many-Shot CoT-ICL: Making In-Context Learning Truly Learn(多示例思维链上下文学习:让上下文学习真正学会) [07:58] 🎯 FrameSkip: Learning from Fewer but More Informative Frames in VLA Training(FrameSkip:在VLA训练中从更少但更具信息量的帧中学习) [08:52] 🌅 The DAWN of World-Action Interactive Models(世界-动作交互模型的黎明) [09:43] 🌊 Asymmetric Flow Models(非对称流模型) [10:24] 🤖 Learning Agentic Policy from Action Guidance(从行动引导中学习智能体策略) [11:23] 💻 Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation(检索成本低廉,给我看代码:面向检索增强生成的可执行多跳推理) [12:13] 🎬 PresentAgent-2: Towards Generalist Multimodal Presentation Agents(PresentAgent-2:迈向通用多模态演示智能体) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
2026.05.13 | 原生统一看画;边缘隐私记管【目录】 本期的 15 篇论文如下: [00:23] 🧠 SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture(SenseNova-U1: 基于NEO-unify架构统一多模态理解与生成) [01:10] 🔒 MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents(MemPrivacy:面向边缘-云智能体的隐私保护个性化记忆管理) [01:59] 🧠 $δ$-mem: Efficient Online Memory for Large Language Models(δ-mem:面向大型语言模型的高效在线记忆机制) [02:43] 🤖 RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards(RubricEM:超越可验证奖励的元强化学习与基于量规引导的策略分解) [03:33] 🤖 World Action Models: The Next Frontier in Embodied AI(世界动作模型:具身智能的下一个前沿) [04:22] 🤖 AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward(AlphaGRPO:通过分解可验证奖励解锁统一多模态模型中的自反思多模态生成) [05:09] 🧩 Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization(超越最后一层:多层表示融合用于视觉标记化) [06:12] 🛠 ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents(ToolCUA:面向计算机使用代理的最优GUI-工具路径编排) [06:51] 🏭 Do Enterprise Systems Need Learned World Models? The Importance of Context to Infer Dynamics(企业系统需要学习世界模型吗?上下文在推断动态中的重要性) [07:52] 🎨 L2P: Unlocking Latent Potential for Pixel Generation(L2P:解锁像素生成的潜在潜能) [08:33] 🎬 CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives(CausalCine:面向多镜头视频叙事的实时自回归生成) [09:18] 🔍 Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents(面向视觉原生多模态深度搜索代理的在策略数据进化方法) [10:18] 💻 Teaching Language Models to Think in Code(教语言模型用代码思考) [10:58] 🛡 On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment(基于失败轨迹的在线策略自我进化方法用于智能体安全对齐) [11:52] 🌌 MCP-Cosmos: World Model-Augmented Agents for Complex Task Execution in MCP Environments(MCP-Cosmos:MCP环境中用于复杂任务执行的世界模型增强型智能体) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
2026.05.12 | 数学家闭门出题考倒大模型;生图模型千字提示精准成画【目录】 本期的 15 篇论文如下: [00:25] 🧮 Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs(Soohak:由数学家策划的基准测试,用于评估大语言模型的研究级数学能力) [01:30] 🎨 Qwen-Image-2.0 Technical Report(Qwen-Image-2.0技术报告) [02:23] 🎥 CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models(CollabVR:基于视觉语言与视频生成模型的协作式视频推理) [03:08] 🧠 TMAS: Scaling Test-Time Compute via Multi-Agent Synergy(TMAS:通过多智能体协同扩展测试时计算) [03:52] 📄 PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents(PaperFit:面向科学文档的视觉在环排版优化) [04:34] 📈 Model Merging Scaling Laws in Large Language Models(大语言模型中的模型合并缩放定律) [05:19] 🧩 Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training(几何冲突:解释并控制大语言模型持续后训练中的遗忘现象) [06:20] 🌍 WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors(世界推理基准:作为未来世界状态预测器的视频生成器的人类对齐压力测试) [07:12] 📊 Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria(自动评分标准作为奖励:从隐式偏好到显式多模态生成准则) [08:03] 🤖 X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction(X-OmniClaw技术报告:一种用于多模态理解与交互的统一移动智能体) [08:51] 🧠 Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models(内存高效的循环Transformer:在循环语言模型中解耦计算与内存) [09:35] 🔄 SEIF: Self-Evolving Reinforcement Learning for Instruction Following(SEIF:面向指令跟随的自我进化强化学习) [10:19] 🔄 Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning(面向智能体强化学习的动态技能生命周期管理) [11:10] 🎨 Pixal3D: Pixel-Aligned 3D Generation from Images(Pixal3D: 从图像进行像素对齐的三维生成) [11:54] 🔄 Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR(叛逆学生:通过自蒸馏强化学习中的反向教师信号进行推理探索) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
2026.05.11 | 音乐驱舞拆分专家;流匹配蒸馏全科状元【目录】 本期的 15 篇论文如下: [00:29] 💃 MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation(MACE-Dance:音乐驱动舞蹈视频生成的运动与外观级联专家模型) [01:07] 🎯 Flow-OPD: On-Policy Distillation for Flow Matching Models(Flow-OPD:面向流匹配模型的在线策略蒸馏) [01:58] 🎯 Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex(列表式策略优化:基于组的RLVR作为LLM响应单纯形上的目标投影) [02:52] 🔍 HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents(HyperEyes:面向并行多模态搜索代理的双粒度效率感知强化学习) [03:37] 🤖 LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling(大语言模型自我进化:面向测试时扩展的智能体发现框架) [04:20] 🎥 HumanNet: Scaling Human-centric Video Learning to One Million Hours(HumanNet:将人类中心视频学习扩展到一百万小时) [05:09] 🧠 Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers(均值模式尖叫:面向千层扩散Transformer的均值-方差分裂残差) [06:06] 🔍 Beyond Retrieval: A Multitask Benchmark and Model for Code Search(超越检索:面向代码搜索的多任务基准与模型) [07:06] 🧩 Anisotropic Modality Align(各向异性模态对齐) [07:58] 🤖 AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning(AEM:面向多轮智能体强化学习的自适应熵调制) [08:49] 📜 TextLDM: Language Modeling with Continuous Latent Diffusion(TextLDM:基于连续潜在扩散的语言建模) [09:41] 🧠 4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding(4D思考者:利用4D图像进行动态空间理解的思考) [10:25] 🎬 A$^2$RD: Agentic Autoregressive Diffusion for Long Video Consistency(A²RD:用于长视频一致性的智能自回归扩散模型) [11:04] 🛡 DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents(DecodingTrust-Agent平台(DTap):一个可控且可交互的AI智能体红队测试平台) [11:52] 🔍 MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference(MISA:面向长上下文大语言模型推理的混合索引器稀疏注意力机制) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
【周末特辑】5月第2周最火AI论文 | MolmoAct2开源机器人大脑;长文狼人杀自练暗规则【目录】 本期的 5 篇论文如下: [00:33] TOP1(🔥266) | 🤖 MolmoAct2: Action Reasoning Models for Real-world Deployment(MolmoAct2:面向实际部署的動作推理模型) [03:10] TOP2(🔥145) | 🧠 From Context to Skills: Can Language Models Learn from Context Skillfully?(从上下文到技能:语言模型能否从上下文中巧妙学习?) [05:03] TOP3(🔥117) | 🎥 Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation(Stream-R1:面向流式视频生成的可靠性-困惑度感知奖励蒸馏) [07:22] TOP4(🔥101) | 🤖 RLDX-1 Technical Report(RLDX-1技术报告) [09:45] TOP5(🔥99) | 🤖 ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration(自主研究:通过对抗性多智能体协作实现自动化科研) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
2026.05.08 | 全局速写助长文;技能库让智能体进化【目录】 本期的 15 篇论文如下: [00:23] 🧠 MiA-Signature: Approximating Global Activation for Long-Context Understanding(MiA-签名:面向长上下文理解的全局激活近似方法) [01:32] 🧬 Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning(Skill1:通过强化学习实现技能增强型智能体的统一进化) [02:14] 🎯 MARBLE: Multi-Aspect Reward Balance for Diffusion RL(MARBLE:面向扩散强化学习的多维度奖励平衡方法) [03:08] 🤖 When to Trust Imagination: Adaptive Action Execution for World Action Models(何时信任想象力:面向世界动作模型的自适应动作执行) [04:06] 🧠 Continuous Latent Diffusion Language Model(连续潜在扩散语言模型) [04:50] 🏆 RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation(RaguTeam 在 SemEval-2026 任务8:基于裁判编排的大语言模型集成实现忠实的多轮响应生成) [05:36] 🧠 Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration(无意义文本助力:提示空间扰动拓宽推理探索) [06:13] ⚡ Continuous-Time Distribution Matching for Few-Step Diffusion Distillation(连续时间分布匹配用于少步扩散蒸馏) [06:48] 🎬 Audio-Visual Intelligence in Large Foundation Models(大型基础模型中的音视频智能) [07:24] 🤖 Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes(利用专家智能体进行自动研究,开发高效且非平凡的训练方案) [08:12] 🤖 A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping(A²TGPO:基于自适应回合级裁剪的智能体回合组策略优化) [09:12] 🧩 UniPool: A Globally Shared Expert Pool for Mixture-of-Experts(UniPool:面向混合专家模型的全局共享专家池) [09:58] 🧠 SkillOS: Learning Skill Curation for Self-Evolving Agents(SkillOS:学习技能策展以实现自我进化智能体) [10:49] 🚗 ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving(ReflectDrive-2:面向离散扩散驾驶的强化学习对齐自编辑方法) [11:46] 📊 TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding(TabEmbed:面向表格理解的通用嵌入模型的基准测试与学习) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
2026.05.07 | 奖励蒸馏让像素会“挑重点”;测试时扩展逐块稳长视频【目录】 本期的 15 篇论文如下: [00:24] 🎥 Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation(Stream-R1:面向流式视频生成的可靠性-困惑度感知奖励蒸馏) [01:27] 🎥 Stream-T1: Test-Time Scaling for Streaming Video Generation(Stream-T1:面向流式视频生成的测试时扩展) [02:06] 🔍 OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents(OpenSearch-VL:前沿多模态搜索智能体的开放配方) [03:07] 🤖 RLDX-1 Technical Report(RLDX-1技术报告) [04:06] 🚗 HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation(HERMES++:迈向统一驾驶世界模型,用于3D场景理解与生成) [04:50] ⚙ PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World(PhysForge:为交互式虚拟世界生成物理基础的3D资产) [05:40] 🎨 D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models(D-OPSD:用于持续调优步蒸馏扩散模型的在策略自蒸馏方法) [06:38] 🔍 Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems(重新思考推理密集型检索:评估与推进智能体搜索系统中的检索器) [07:46] ⚡ Lightning Unified Video Editing via In-Context Sparse Attention(基于上下文稀疏注意力的闪电式统一视频编辑) [08:38] 🧠 Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation(在多模态统一理解与生成中唤醒空间智能) [09:27] 🎯 Parameter-Efficient Multi-View Proficiency Estimation: From Discriminative Classification to Generative Feedback(参数高效的多视角技能评估:从判别分类到生成式反馈) [10:11] 🎵 APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music(APEX:面向AI生成音乐的大规模多任务审美感知流行度预测) [10:54] 🧠 ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning(ResRL:通过负样本投影残差强化学习提升大语言模型推理能力) [11:47] 🧩 Diffusion Model as a Generalist Segmentation Learner(扩散模型作为通用分割学习器) [12:26] 🔬 MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills(MedSkillAudit:面向医学研究智能体技能的领域特定审计框架) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
2026.05.06 | ARIS自怼写论文;PRISM三段洗数据再RL【目录】 本期的 15 篇论文如下: [00:25] 🤖 ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration(自主研究:通过对抗性多智能体协作实现自动化科研) [00:59] 🎯 Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL(超越SFT到RL:通过黑盒在线策略蒸馏实现多模态强化学习的预对齐) [01:54] 🔍 OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories(OpenSeeker-v2:用信息丰富且高难度的轨迹推动搜索智能体的极限) [02:42] 🎯 X2SAM: Any Segmentation in Images and Videos(X2SAM:图像与视频中的任意分割) [03:23] 🧠 HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness(HeavySkill:智能体框架中的深度思考作为内在技能) [04:23] 🎬 Video Generation with Predictive Latents(基于预测性潜变量的视频生成) [05:05] 📜 PatRe: A Full-Stage Office Action and Rebuttal Generation Benchmark for Patent Examination(PatRe:面向专利审查的全阶段审查意见与答复生成基准) [05:45] 🎨 SVGS: Enhancing Gaussian Splatting Using Primitives with Spatially Varying Colors(SVGS:利用空间变化颜色基元增强高斯泼溅) [06:31] 📂 Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies(工作空间基准1.0:针对具有大规模文件依赖的工作空间任务评估AI代理) [07:28] 🤒 SymptomAI: Towards a Conversational AI Agent for Everyday Symptom Assessment(SymptomAI: 面向日常症状评估的对话式AI代理) [08:11] 🤖 Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces(基于编排轨迹的大语言模型多智能体系统强化学习) [08:44] 🧩 SplAttN: Bridging 2D and 3D with Gaussian Soft Splatting and Attention for Point Cloud Completion(SplAttN:利用高斯软溅射与注意力机制桥接2D和3D的点云补全) [09:39] 🌍 A Benchmark for Interactive World Models with a Unified Action Generation Framework(交互式世界模型基准测试与统一动作生成框架) [10:25] 🔄 The TTS-STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail(TTS-STT飞轮:合成密集实体音频填补了商业和开源系统失败的印地语ASR差距) [11:12] 💬 TCDA: Thread-Constrained Discourse-Aware Modeling for Conversational Sentiment Quadruple Analysis(TCDA:线程约束的对话感知建模用于对话情感四元组分析) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
2026.05.05 | 开源MolmoAct2实战87%成功率;GPT上下文提炼技能再升级【目录】 本期的 14 篇论文如下: [00:21] 🤖 MolmoAct2: Action Reasoning Models for Real-world Deployment(MolmoAct2:面向实际部署的動作推理模型) [01:02] 🧠 From Context to Skills: Can Language Models Learn from Context Skillfully?(从上下文到技能:语言模型能否从上下文中巧妙学习?) [01:44] 🔁 Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling(重复胜于多样:面向样本高效德语语言模型的高信号数据过滤) [02:35] 👁 Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs(持久视觉记忆:在大视觉语言模型中维持感知以支持深度生成) [03:18] 🌊 OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models(海洋堆:面向基础模型的大规模多模态海洋语料库) [03:56] 🧩 ComboStoc: Combinatorial Stochasticity for Diffusion Generative Models(ComboStoc:扩散生成模型的组合随机性) [04:44] 🎓 AcademiClaw: When Students Set Challenges for AI Agents(AcademiClaw:当学生为AI代理设置挑战时) [05:25] 🏥 PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments(医师基准:在真实电子健康记录环境中评估大语言模型智能体) [06:06] 🤖 T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning(T²PO:不确定性引导的探索控制实现稳定多轮智能体强化学习) [07:04] 🌳 Hierarchical Abstract Tree for Cross-Document Retrieval-Augmented Generation(面向跨文档检索增强生成的分层抽象树) [07:54] 🌌 Generative Modeling with Orbit-Space Particle Flow Matching(基于轨道空间粒子流匹配的生成式建模) [08:30] 🧠 Perceptual Flow Network for Visually Grounded Reasoning(感知流网络用于视觉基础推理) [09:06] 🎬 Motion-Aware Caching for Efficient Autoregressive Video Generation(运动感知缓存实现高效自回归视频生成) [09:54] 🤖 Code World Model Preparedness Report(代码世界模型准备情况报告) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递