EP75 · Video Agents Are Next, Voice Agent · 06-02 - BestBlogs Podcast

Deep Dive 1: Why Video Agent models are next — Ethan He， xAI Grok Imagine Lead

From Latent.Space Ethan He (xAI Grok Imagine lead) argues video models get their intelligence from LLMs, not video data — so the quality ceiling tracks LLM progress. The next Sora won't be a better video model but a video agent: plan, generate, edit, critique, iterate — mirroring how coding shifted from one-shot output to agentic workflows. Grok Imagine Agent Mode is the first real proof of this thesis.

Deep Dive 2: Engineering voice agents: Latency， quality， and scale — Rishabh Bhargava， Together AI [Video]

From AI Engineer Bhargava (Together AI) maps production voice agent constraints: humans expect 300ms responses, >500ms kills engagement. Optimal pipeline chains streaming STT → 8B–30B LLM (200–300ms TTFT) → TTS with RTF <1.0. Infrastructure collocation alone cuts latency 30%. The Thinker-Talker pattern sends immediate filler audio while a heavier guarded model processes actual logic asynchronously — the trick that makes safety checks affordable at conversational speed.

Deep Dive 3: RAG Is Not Machine Learning， and the ML Toolkit Solves the Wrong Problem

From Towards Data Science A team ran five Optuna sweeps and fine-tuned embeddings for 6 months — production accuracy never moved. The bug was in the parser. RAG is a search and engineering problem, not ML: wrong answers are individually traceable failures, not statistical noise. Chunk size is a config choice, not a hyperparameter — you need to read your documents, not run grid searches. Fix RAG by engineering the structure better, not by training the model more.

EP75 · Video Agents Are Next, Voice Agent · 06-02

Deep Dive 1: Why Video Agent models are next — Ethan He， xAI Grok Imagine Lead

Deep Dive 2: Engineering voice agents: Latency， quality， and scale — Rishabh Bhargava， Together AI [Video]

Deep Dive 3: RAG Is Not Machine Learning， and the ML Toolkit Solves the Wrong Problem

Quick Takes

Related Links