

2025.12.11 | StereoWorld单目秒变立体大片;BiCo跨域拼贴新概念本期的 15 篇论文如下: [00:22] 🎥 StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation(StereoWorld:几何感知的单目到立体视频生成) [00:59] 🎨 Composing Concepts from Images and Videos via Concept-prompt Binding(通过概念-提示绑定从图像和视频中组合概念) [01:43] 🧠 BrainExplore: Large-Scale Discovery of Interpretable Visual Representations in the Human Brain(BrainExplore:人脑中可解释视觉表征的大规模发现) [02:20] 🎨 OmniPSD: Layered PSD Generation with Diffusion Transformer(OmniPSD:基于扩散Transformer的分层PSD生成) [03:05] 🚀 InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models(InfiniteVL:融合线性与稀疏注意力以实现高效、无限输入的视觉语言模型) [03:47] ⚡ Fast-Decoding Diffusion Language Models via Progress-Aware Confidence Schedules(通过进度感知置信度调度实现扩散语言模型的快速解码) [04:31] 🚗 UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving(UniUGP:面向端到端自动驾驶的理解、生成与规划统一框架) [05:06] 🧠 EtCon: Edit-then-Consolidate for Reliable Knowledge Editing(EtCon:面向可靠知识编辑的先编辑后巩固范式) [05:56] 🤖 HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models(HiF-VLA:通过运动表征实现视觉-语言-动作模型的后见、洞见与先见) [06:46] 🔍 WonderZoom: Multi-Scale 3D World Generation(WonderZoom:多尺度三维世界生成) [07:23] 🤖 Learning Unmasking Policies for Diffusion Language Models(扩散语言模型的解掩码策略学习) [07:53] 🔭 IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual Prompting(IF-Bench:基于生成式视觉提示的红外图像多模态大语言模型基准测试与增强) [08:51] ⚡ Beyond Unified Models: A Service-Oriented Approach to Low Latency, Context Aware Phonemization for Real Time TTS(超越统一模型:面向服务的低延迟、上下文感知实时TTS音素化方法) [09:31] 🎬 VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory(VideoSSM:基于混合状态空间记忆的自回归长视频生成) [10:16] 🛡 Pay Less Attention to Function Words for Free Robustness of Vision-Language Models(减少对功能词的关注以免费提升视觉语言模型的鲁棒性) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
2025.12.10 | 潜在轨迹控运动;WebGPU实时溅射本期的 15 篇论文如下: [00:24] 🎬 Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance(Wan-Move:通过潜在轨迹引导实现运动可控的视频生成) [00:55] 🚀 Visionary: The World Model Carrier Built on WebGPU-Powered Gaussian Splatting Platform(Visionary:基于WebGPU驱动的高斯溅射平台的世界模型载体) [01:32] 🎬 Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality(保持源视频真实感:面向电影级质量的高保真人脸交换) [02:13] 🎬 OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory(OneStory:基于自适应记忆的连贯多镜头视频生成) [02:49] ⚡ ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models(ThreadWeaver:面向语言模型高效并行推理的自适应线程技术) [03:45] 🤖 MIND-V: Hierarchical Video Generation for Long-Horizon Robotic Manipulation with RL-based Physical Alignment(MIND-V:基于强化学习物理对齐的长时程机器人操作分层视频生成) [04:47] 🚀 Boosting Unsupervised Video Instance Segmentation with Automatic Quality-Guided Self-Training(通过自动质量引导自训练提升无监督视频实例分割) [05:18] 🌲 TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models(TreeGRPO:基于树优势的GRPO用于扩散模型的在线强化学习后训练) [05:55] 🚀 From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs(从下一个词到下一个块:扩散语言模型的原则性适应路径) [06:30] 📊 EcomBench: Towards Holistic Evaluation of Foundation Agents in E-commerce(EcomBench:面向电子商务领域基础智能体的全面评估) [07:02] 🧩 Modular Neural Image Signal Processing(模块化神经图像信号处理) [07:33] 🧭 Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation(慢思考,快行动:用于通用视觉语言导航的双系统基础模型) [08:16] 🤖 DeepCode: Open Agentic Coding(DeepCode:开放式智能体编码) [08:48] 🎯 TrackingWorld: World-centric Monocular 3D Tracking of Almost All Pixels(TrackingWorld:以世界为中心的几乎所有像素单目三维跟踪) [09:30] 🎬 Efficiently Reconstructing Dynamic Scenes One D4RT at a Time(高效动态场景重建:一次一个D4RT) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
2025.12.09 | 并行自蒸馏提速4.6倍;虚部RoPE++长文本双优化本期的 15 篇论文如下: [00:20] ⚡ Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning(原生并行推理器:通过自蒸馏强化学习实现并行推理) [01:04] 🧠 Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs(超越实数:用于长上下文大语言模型的旋转位置编码虚部扩展) [01:54] 🎬 Unified Video Editing with Temporal Reasoner(基于时序推理的统一视频编辑) [02:33] 🔍 DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems(DoVer:面向LLM多智能体系统的干预驱动自动调试方法) [03:24] 🎮 Voxify3D: Pixel Art Meets Volumetric Rendering(Voxify3D:像素艺术与体素渲染的融合) [04:07] 🎬 Scaling Zero-Shot Reference-to-Video Generation(零样本参考到视频生成的规模化研究) [04:39] 🧬 Distribution Matching Variational AutoEncoder(分布匹配变分自编码器) [05:12] 🔭 Multi-view Pyramid Transformer: Look Coarser to See Broader(多视图金字塔Transformer:看粗以见广) [05:47] 🎬 EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing(EgoEdit:用于第一人称视频编辑的数据集、实时流式模型与基准测试) [06:25] 🖼 LongCat-Image Technical Report(LongCat-Image技术报告) [06:50] 🎬 UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation(UnityVideo:统一多模态多任务学习以增强世界感知的视频生成) [07:36] 🔗 Relational Visual Similarity(关系视觉相似性) [08:13] 🔬 On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models(论预训练、中期训练与强化学习在推理语言模型中的相互作用) [08:57] 🎥 ReCamDriving: LiDAR-Free Camera-Controlled Novel Trajectory Video Generation(ReCamDriving:无需LiDAR的相机控制新轨迹视频生成) [09:30] 🚀 Beyond Token-level Supervision: Unlocking the Potential of Decoding-based Regression via Reinforcement Learning(超越词级监督:通过强化学习解锁基于解码的回归潜力) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
2025.12.08 | 自对抗一步生成;外挂评审迭代编辑本期的 15 篇论文如下: [00:19] ⚡ TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows(TwinFlow:基于自对抗流实现大模型的一步生成) [00:49] 🤔 EditThinker: Unlocking Iterative Reasoning for Any Image Editor(EditThinker:为任意图像编辑器解锁迭代推理能力) [01:26] 🎨 PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling(PaCo-RL:通过成对奖励建模推进强化学习在一致性图像生成中的应用) [02:05] 📈 From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks(从模仿到判别:一种增强跨领域推理任务的通用课程优势机制) [02:55] ⚖ Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning(熵比率裁剪:一种用于稳定强化学习的软性全局约束) [03:38] 🎬 Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image(基于单张图像的联合三维几何重建与运动生成以实现四维合成) [04:15] 🧠 COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence(COOPER:空间智能中协同感知与推理的统一模型) [04:45] 🎨 RealGen: Photorealistic Text-to-Image Generation via Detector-Guided Rewards(RealGen:通过检测器引导的奖励实现逼真的文本到图像生成) [05:16] 🔍 ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning(ReVSeg:利用强化学习激励视频分割中的推理链) [05:49] 🎥 World Models That Know When They Don't Know: Controllable Video Generation with Calibrated Uncertainty(知晓自身不确定性的世界模型:具有校准不确定性的可控视频生成) [06:24] 🎮 SpaceControl: Introducing Test-Time Spatial Control to 3D Generative Modeling(SpaceControl:为3D生成模型引入测试时空间控制) [07:14] 🤖 Self-Improving VLM Judges Without Human Annotations(无需人工标注的自改进视觉语言模型评判器) [07:54] 🎬 SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations(SCAIL:通过三维一致姿态表征的上下文学习实现影视级角色动画) [08:30] 🤝 AI & Human Co-Improvement for Safer Co-Superintelligence(人工智能与人类协同进化以实现更安全的协同超级智能) [09:08] 🎬 ProPhy: Progressive Physical Alignment for Dynamic World Simulation(ProPhy:面向动态世界模拟的渐进式物理对齐框架) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
【周末特辑】12月第2周最火AI论文 | 代码智能全链路拆解;开源DeepSeek-V3.2登顶本期的 5 篇论文如下: [00:32] TOP1(🔥239) | 🧠 From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence(从代码基础模型到智能体与应用:代码智能实用指南) [03:05] TOP2(🔥169) | 🚀 DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models(DeepSeek-V3.2:推动开放大型语言模型前沿) [04:58] TOP3(🔥148) | 🎬 LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling(LongVT:通过原生工具调用激励“长视频思考”) [07:00] TOP4(🔥147) | 🚀 Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer(Z-Image:基于单流扩散Transformer的高效图像生成基础模型) [08:58] TOP5(🔥137) | 🤖 Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length(实时虚拟化身:基于无限时长的流式实时音频驱动化身生成) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
2025.12.05 | DAComp立Agent新靶;流式化身无限实时本期的 15 篇论文如下: [00:22] 📊 DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle(DAComp:跨全数据智能生命周期的数据智能体基准测试) [01:07] 🤖 Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length(实时虚拟化身:基于无限时长的流式实时音频驱动化身生成) [01:42] 🤖 Nex-N1: Agentic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction(Nex-N1:基于统一生态系统大规模环境构建训练的智能体模型) [02:24] 🤖 ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning(ARM-Thinker:通过智能体工具使用与视觉推理增强多模态生成奖励模型) [02:54] 🎬 Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation(奖励强制:通过奖励分布匹配蒸馏实现高效流式视频生成) [03:42] 🚀 Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion(语义先行:通过异步潜在扩散协调语义与纹理建模) [04:25] 🔧 PaperDebugger: A Plugin-Based Multi-Agent System for In-Editor Academic Writing, Review, and Editing(PaperDebugger:一种基于插件的多智能体系统,用于编辑器内的学术写作、审阅与编辑) [04:56] 🌍 DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling(DynamicVerse:一个物理感知的多模态4D世界建模框架) [05:47] 🌀 4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer(4DLangVGGT:基于Transformer的4D语言-视觉几何接地统一框架) [06:15] 🔍 UltraImage: Rethinking Resolution Extrapolation in Image Diffusion Transformers(UltraImage:重新思考图像扩散变换器中的分辨率外推) [07:02] 🎨 DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation(DraCo:以草稿作为思维链实现文本到图像预览与稀有概念生成) [07:51] ❄ Splannequin: Freezing Monocular Mannequin-Challenge Footage with Dual-Detection Splatting(Splannequin:基于双重检测溅射的单目人体模型挑战视频冻结) [08:34] 🤖 SIMA 2: A Generalist Embodied Agent for Virtual Worlds(SIMA 2:面向虚拟世界的通用具身智能体) [09:05] 🧮 Model-Based and Sample-Efficient AI-Assisted Math Discovery in Sphere Packing(基于模型与样本高效的AI辅助数学发现:球体填充问题研究) [09:47] 🧭 SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization(SeeNav-Agent:通过视觉提示与步级策略优化增强视觉语言导航) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
2025.12.04 | Qwen3-VL多模态超长上下文;PretrainZero强化主动预训练本期的 15 篇论文如下: [00:24] 🧠 Qwen3-VL Technical Report(Qwen3-VL 技术报告) [00:57] 🧠 PretrainZero: Reinforcement Active Pretraining(PretrainZero:强化主动预训练) [01:36] 🎬 ViDiC: Video Difference Captioning(ViDiC:视频差异描述) [02:24] 🧠 OneThinker: All-in-one Reasoning Model for Image and Video(OneThinker:面向图像与视频的全能推理模型) [03:07] 🔄 Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation(重新思考文本到视觉生成中推理时扩展的提示设计) [03:59] ⚙ Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach(引导视觉-语言-动作模型作为反探索:一种测试时缩放方法) [04:46] 🤖 SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL(SpaceTools:通过双重交互式强化学习实现工具增强的空间推理) [05:22] 🔧 Thinking with Programming Vision: Towards a Unified View for Thinking with Images(以编程视觉思考:迈向图像思维的统一视角) [06:01] 🔄 Flowing Backwards: Improving Normalizing Flows via Reverse Representation Alignment(逆向流动:通过反向表征对齐改进标准化流) [06:51] 🎮 RELIC: Interactive Video World Model with Long-Horizon Memory(RELIC:具备长时记忆的交互式视频世界模型) [07:34] 🍳 CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation(CookAnything:灵活且一致的多步骤食谱图像生成框架) [08:26] 🧠 SR-GRPO: Stable Rank as an Intrinsic Geometric Reward for Large Language Model Alignment(SR-GRPO:将稳定秩作为大语言模型对齐的内在几何奖励) [09:01] 📊 AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs(AlignBench:基于合成图像-描述对评估细粒度图文对齐的基准) [09:38] 🧠 SkillFactory: Self-Distillation For Learning Cognitive Behaviors(SkillFactory:用于学习认知行为的自蒸馏方法) [10:20] 📱 UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs(UniQL:面向自适应边缘大语言模型的统一量化与低秩压缩) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
【月末特辑】11月最火AI论文 | Kandinsky 5.0全家桶开源;视频生成让模型边播边想本期的 10 篇论文如下: [00:35] TOP1(🔥219) | 🎨 Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation(Kandinsky 5.0:用于图像和视频生成的基础模型家族) [02:45] TOP2(🔥207) | 🎬 Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm(用视频思考:视频生成作为统一多模态推理新范式) [04:58] TOP3(🔥191) | 🌍 Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds(Lumine:在3D开放世界中打造通才智能体的开源方案) [07:26] TOP4(🔥166) | ⚡ ROOT: Robust Orthogonalized Optimizer for Neural Network Training(ROOT:面向神经网络训练的鲁棒正交化优化器) [09:37] TOP5(🔥156) | 🚀 MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling(MiroThinker:通过模型、上下文与交互扩展,将开源研究智能体性能推向新边界) [11:54] TOP6(🔥151) | 🧠 General Agentic Memory Via Deep Research(通过深度研究的通用代理记忆) [13:55] TOP7(🔥131) | 🏅 P1: Mastering Physics Olympiads with Reinforcement Learning(用强化学习攻克物理奥赛) [16:01] TOP8(🔥131) | 🍲 Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance(“汤”级模型:简单加权平均即可让大语言模型性能跃升) [18:03] TOP9(🔥126) | 🧠 Tiny Model, Big Logic: Diversity-Driven Optimization Elicits Large-Model Reasoning Ability in VibeThinker-1.5B(小模型大逻辑:多样性驱动优化唤醒VibeThinker-1.5B的大模型推理力) [20:14] TOP10(🔥121) | 🚀 Diffusion Language Models are Super Data Learners(扩散语言模型是超级数据学习者) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
2025.12.02 | 代码智能四步落地;LongVT长视频精准理解本期的 15 篇论文如下: [00:20] 🧠 From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence(从代码基础模型到智能体与应用:代码智能实用指南) [01:05] 🎬 LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling(LongVT:通过原生工具调用激励“长视频思考”) [01:43] 🔍 Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights(Envision:面向因果世界过程洞察的统一理解与生成基准) [02:21] 🧠 Stabilizing Reinforcement Learning with LLMs: Formulation and Practices(利用大语言模型稳定强化学习的公式与实践) [02:59] 🔍 How Far Are We from Genuinely Useful Deep Research Agents?(我们距离真正有用的深度研究智能体还有多远?) [03:47] ⚖ What about gravity in video generation? Post-Training Newton's Laws with Verifiable Rewards(视频生成中的重力考量?基于可验证奖励的后训练牛顿定律应用) [04:24] 🔍 The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment(一致性批评家:通过参考引导的注意力对齐纠正生成图像中的不一致性) [05:14] 🎬 Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout(Infinity-RoPE:从自回归自展开中涌现的可控动作无限视频生成) [05:58] 🔗 TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models(TUNA:驯服统一视觉表征以构建原生统一多模态模型) [06:41] 🧠 Rectifying LLM Thought from Lens of Optimization(从优化视角修正大语言模型的思维过程) [07:16] ⚡ Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning(Flash-DMD:基于高效蒸馏与联合强化学习实现高保真少步图像生成) [07:54] 🚀 LFM2 Technical Report(LFM2 技术报告) [08:31] 🤖 GR-RL: Going Dexterous and Precise for Long-Horizon Robotic Manipulation(GR-RL:迈向灵巧与精准的长时程机器人操作) [09:09] 🎬 InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision(InternVideo-Next:迈向无需视频文本监督的通用视频基础模型) [09:44] ⚡ VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference(VLASH:基于未来状态感知的异步推理实现实时视觉-语言-动作模型) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
2025.12.01 | Z-Image小参高效夺冠;REASONEDIT先思后画登顶本期的 15 篇论文如下: [00:26] 🚀 Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer(Z-Image:基于单流扩散Transformer的高效图像生成基础模型) [01:00] 🤔 REASONEDIT: Towards Reasoning-Enhanced Image Editing Models(REASONEDIT:迈向推理增强的图像编辑模型) [01:25] 🎬 AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement(AnyTalker:通过交互性精炼实现可扩展的多人物对话视频生成) [01:59] 🌉 Vision Bridge Transformer at Scale(大规模视觉桥接变换器) [02:35] 🔍 Architecture Decoupling Is Not All You Need For Unified Multimodal Model(架构解耦并非统一多模态模型的全部所需) [03:23] ⚡ DiP: Taming Diffusion Models in Pixel Space(DiP:在像素空间驾驭扩散模型) [03:49] 🧠 Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models(每个令牌都重要:在大型语言模型中泛化1600万超长上下文) [04:19] 🤖 DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action(DualVLA:通过部分解耦推理与动作构建可泛化的具身智能体) [05:02] ⚡ Adversarial Flow Models(对抗性流模型) [05:29] 🔬 Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield(解耦的DMD:CFG增强为矛,分布匹配为盾) [06:10] 🎥 Captain Safari: A World Engine(Captain Safari:一种世界引擎) [06:43] 🌍 World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models(框架中的世界:理解文化混合作为视觉语言模型的新挑战) [07:20] 🔍 The Collapse of Patches(图像块坍缩) [07:50] 🔍 RefineBench: Evaluating Refinement Capability of Language Models via Checklists(RefineBench:基于检查表评估语言模型精炼能力) [08:23] 🦷 OralGPT-Omni: A Versatile Dental Multimodal Large Language Model(OralGPT-Omni:一个通用的牙科多模态大语言模型) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
【周末特辑】11月第5周最火AI论文 | 自适应正交稳训练;GAM代理即搜忆本期的 5 篇论文如下: [00:51] TOP1(🔥161) | ⚡ ROOT: Robust Orthogonalized Optimizer for Neural Network Training(ROOT:面向神经网络训练的鲁棒正交化优化器) [02:35] TOP2(🔥141) | 🧠 General Agentic Memory Via Deep Research(通过深度研究的通用代理记忆) [04:33] TOP3(🔥110) | 🧬 GigaEvo: An Open Source Optimization Framework Powered By LLMs And Evolution Algorithms(GigaEvo:基于大语言模型与进化算法的开源优化框架) [06:56] TOP4(🔥88) | 🎯 SAM 3: Segment Anything with Concepts(SAM 3:基于概念的通用分割模型) [09:12] TOP5(🔥88) | 🌍 GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization(GeoVista:用于地理定位的Web增强智能视觉推理) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
2025.11.28 | 潜在奖励模型提速降显存;画布多模态生成碾压SOTA本期的 6 篇论文如下: [00:19] 🎬 Video Generation Models Are Good Latent Reward Models(视频生成模型是优秀的潜在奖励模型) [01:07] 🎨 Canvas-to-Image: Compositional Image Generation with Multimodal Controls(画布到图像:基于多模态控制的组合式图像生成) [01:49] 🎨 MIRA: Multimodal Iterative Reasoning Agent for Image Editing(MIRA:多模态迭代推理代理用于图像编辑) [02:30] 📊 Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following(多准则:多模态评估器在多元化标准遵循上的基准测试) [03:12] 🧠 What does it mean to understand language?(理解语言意味着什么?) [03:47] 🧠 Agentic Learner with Grow-and-Refine Multimodal Semantic Memory(具有生长与精炼多模态语义记忆的自主学习者) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
2025.11.27 | 俄语多模态评测补空白;潜协作提速14%本期的 15 篇论文如下: [00:22] 🔍 Multimodal Evaluation of Russian-language Architectures(俄语多模态架构的评估框架) [01:15] 🧠 Latent Collaboration in Multi-Agent Systems(多智能体系统中的潜在协作) [01:47] 🌍 Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation(Inferix:基于块扩散的新一代世界模拟推理引擎) [02:18] 🎭 Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy(和谐:通过跨任务协同实现音频与视频生成的统一) [03:10] 📄 NVIDIA Nemotron Parse 1.1(英伟达Nemotron解析1.1) [03:46] 🧠 Monet: Reasoning in Latent Visual Space Beyond Images and Language(Monet:超越图像与语言的潜在视觉空间推理) [04:25] ⚡ Terminal Velocity Matching(终端速度匹配) [05:03] 📊 Revisiting Generalization Across Difficulty Levels: It's Not So Easy(重新审视跨难度级别的泛化能力:并非易事) [05:42] 🤖 MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots(MobileVLA-R1:强化移动机器人的视觉-语言-动作能力) [06:25] ⚡ Image-Free Timestep Distillation via Continuous-Time Consistency with Trajectory-Sampled Pairs(基于轨迹采样对的连续时间一致性图像自由时间步蒸馏) [06:59] 🎮 UniGame: Turning a Unified Multimodal Model Into Its Own Adversary(UniGame:将统一多模态模型转化为其自身的对抗者) [07:47] 🧩 SPHINX: A Synthetic Environment for Visual Perception and Reasoning(SPHINX:用于视觉感知与推理的合成环境) [08:33] ⚡ Block Cascading: Training Free Acceleration of Block-Causal Video Models(块级联:免训练的块因果视频模型加速) [09:12] 🏙 RAISECity: A Multimodal Agent Framework for Reality-Aligned 3D World Generation at City-Scale(RAISECity:面向城市尺度的现实对齐三维世界生成多模态智能体框架) [09:58] 📊 I-GLIDE: Input Groups for Latent Health Indicators in Degradation Estimation(I-GLIDE:基于输入组的退化估计潜在健康指标) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
2025.11.26 | 大模型育种进化框架开源;MedSAM-3听懂临床精准分割本期的 15 篇论文如下: [00:17] 🧬 GigaEvo: An Open Source Optimization Framework Powered By LLMs And Evolution Algorithms(GigaEvo:基于大语言模型与进化算法的开源优化框架) [00:57] 🔬 MedSAM3: Delving into Segment Anything with Medical Concepts(MedSAM3:深入探索基于医学概念的通用分割模型) [01:34] 🔍 Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning(Agent0-VL:探索工具集成视觉语言推理的自进化智能体) [02:03] 🎨 iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation(iMontage:统一、通用、高度动态的多对多图像生成) [02:38] 🕺 SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation(SteadyDancer:基于首帧保持的协调连贯人体图像动画) [03:18] 🔍 Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward(理解是否真正指导统一多模态模型的生成?从分析到前进路径) [04:04] 🤖 GigaWorld-0: World Models as Data Engine to Empower Embodied AI(GigaWorld-0:世界模型作为数据引擎赋能具身AI) [04:44] 🎯 Soft Adaptive Policy Optimization(软自适应策略优化) [05:14] 🎬 UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers(UltraViCo:突破视频扩散变换器的外推极限) [05:55] 🎯 SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space(SSA:通过特征空间中对齐全注意力和稀疏注意力输出的稀疏稀疏注意力) [06:51] 🎨 OmniAlpha: A Sequence-to-Sequence Framework for Unified Multi-Task RGBA Generation(OmniAlpha:面向统一多任务RGBA生成的序列到序列框架) [07:41] 🎬 ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding(ReDirector:使用旋转相机编码创建任意长度视频重拍) [08:13] 🖼 VQ-VA World: Towards High-Quality Visual Question-Visual Answering(VQ-VA世界:迈向高质量视觉问题-视觉回答) [09:06] 🔍 HunyuanOCR Technical Report(幻方OCR技术报告) [09:48] 🏙 MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts(MajutsuCity:语言驱动美学自适应城市生成与可控3D资产及布局) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
2025.11.25 | 即时编译让记忆无损;AutoEnv自动挑环境提两成本期的 15 篇论文如下: [00:25] 🧠 General Agentic Memory Via Deep Research(通过深度研究的通用代理记忆) [00:52] 🧪 AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning(AutoEnv:用于跨环境智能体学习的自动化环境测量) [01:24] 🤖 Computer-Use Agents as Judges for Generative User Interface(以计算机使用代理作为生成式用户界面的评判者) [01:55] 🎨 DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation(DeCo:用于端到端图像生成的频率解耦像素扩散) [02:24] 🎨 UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios(UltraFlux:面向高质量原生4K文本到图像跨多样宽高比的数据-模型协同设计) [03:10] 🔍 DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research(DR Tulu:基于演化评分标准的深度研究强化学习) [03:46] 🎬 In-Video Instructions: Visual Signals as Generative Control(视频内指令:视觉信号作为生成控制) [04:24] 📊 Budget-Aware Tool-Use Enables Effective Agent Scaling(预算感知的工具使用实现有效的智能体扩展) [05:12] 🎬 Plan-X: Instruct Video Generation via Semantic Planning(Plan-X:通过语义规划指导视频生成) [05:54] 🧪 M3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark(M3-Bench:多模态、多跳、多线程工具使用MLLM智能体基准) [06:25] 🤖 Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO(多智能体深度研究:使用M-GRPO训练多智能体系统) [07:24] 🎬 HunyuanVideo 1.5 Technical Report(混元视频1.5技术报告) [07:56] 🧠 Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens(视觉思维链:通过连续视觉标记教导视觉语言模型更好地观察与思考) [08:36] 🧠 MIST: Mutual Information Via Supervised Training(MIST:通过监督训练实现互信息估计) [09:07] 🎨 Controllable Layer Decomposition for Reversible Multi-Layer Image Generation(可控层分解用于可逆多层图像生成) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递