VENTURA turns language instructions into safe, precise robot paths by adapting pre-trained image diffusion models for visual planning, enabling adaptive navigation behaviors in open-world environmets.
General purpose robots must follow diverse human instructions and navigate safely in unstructured environments. While Vision-Language Models (VLMs) provide strong priors, they remain hard to steer for navigation. We present VENTURA, a vision-language navigation system that finetunes internet-pretrained image diffusion models for path planning. Rather than directly predicting low-level actions, VENTURA first produces a visual path mask that captures fine-grained, context-aware behaviors, which a lightweight policy translates into executable trajectories. To scale training, we automatically generate supervision from self-supervised tracking and VLM-augmented captions, avoiding costly manual labels. In real-world evaluations, VENTURA improves success by 33% and reduces collisions by 54% over foundation model baselines, while generalizing to unseen task combinations with emergent compositional skills.
“Continue on the trail. Avoid foliage and loose debris.”
“Continue on the paved path, avoiding metal fences.”
“Follow the crosswalk markings. Keep a safe distance from pedestrians.”
VENTURA is not without its limitations. We view the challenges below as future opportunities to enhance our approach's expressiveness and deployability: