VENTURA: Adapting Image Diffusion Models for Unified Task Conditioned Navigation

VENTURA turns language instructions into safe, precise robot paths by adapting pre-trained image diffusion models for visual planning, enabling adaptive navigation behaviors in open-world environmets.

Abstract

General purpose robots must follow diverse human instructions and navigate safely in unstructured environments. While Vision-Language Models (VLMs) provide strong priors, they remain hard to steer for navigation. We present VENTURA, a vision-language navigation system that finetunes internet-pretrained image diffusion models for path planning. Rather than directly predicting low-level actions, VENTURA first produces a visual path mask that captures fine-grained, context-aware behaviors, which a lightweight policy translates into executable trajectories. To scale training, we automatically generate supervision from self-supervised tracking and VLM-augmented captions, avoiding costly manual labels. In real-world evaluations, VENTURA improves success by 33% and reduces collisions by 54% over foundation model baselines, while generalizing to unseen task combinations with emergent compositional skills.

Adapting Image Diffusion Models for Visual Planning

VENTURA conditions on the image observation and goal instruction to denoise a path mask from random Gaussian noise. A lightweight policy conditions on the predicted path mask and image features to produce a sequence of navigation waypoints. In this manner, VENTURA leverages the strong priors of pre-trained diffusion models to produce precise, long-range navigation plans that can be flexibly adapted to diverse tasks and environments.
VENTURA architecture overview

Scalable Data Curation with Self-Supervised Point Tracking

Modern policy learning methods for navigation often rely on datasets with accurate odometry and carefully design hardware setups, which limits scalability. To scale training from heterogeneous demonstrations, we leverage CoTracker, an off-the-shelf point tracking model, to automatically label uncalibrated, egocentric video with path masks. We further caption these videos with VLMs and human-in-the-loop assistance to generate diverse language annotations.
VENTURA data curation pipeline

Language Annotated Navigation Dataset

Our dataset contains 10 hours of egocentric RGB video with automatically derived path masks from self-supervised point tracking. 1.5 hours are annotated with language instructions describing navigation behaviors. To support future research on language-conditioned navigation, we release the dataset and provide qualitative examples of the language annotations and path masks below.
VENTURA dataset overview

Real-World Evaluation and Baselines

We evaluate VENTURA across over 150 real-world, closed-loop trials across navigation tasks that require diverse skills, including obstacle avoidance, object goal navigation, and preference-aware terrain navigation. We compare against VLA and robot foundation models that leverage VLMs and web-scale data. VENTURA outperforms all baselines by a significant margin, demonstrating its ability to adapt pre-trained diffusion models for precise and safe navigation in unstructured environments.
VENTURA evaluation results

Deployment: Language-Conditioned Navigation

In addition to quantitative experiments, we deploy VENTURA on the same Unitree Go2w robot platform across a variety of urban and off-road trails. We perform offline model inference on human teleoperated robot data to fairly compare VENTURA qualitatively against baselines. The model successfully follows diverse language instructions in unseen environments, demonstrating its ability to generalize to new tasks and settings.

“Continue on the trail. Avoid foliage and loose debris.”

“Continue on the paved path, avoiding metal fences.”

“Follow the crosswalk markings. Keep a safe distance from pedestrians.”

Limitations

VENTURA limitations examples

VENTURA is not without its limitations. We view the challenges below as future opportunities to enhance our approach's expressiveness and deployability:

  • Generalization to novel motion primitives: Our model does not extrapolate to complex motion patterns (e.g. circle around the house).
  • Understanding objects dynamics: Complex motion dynamics (e.g. social behavior or vehicle dynamics) are not addressed by our approach.
  • Image space planning limitations: Specific navigation plans (e.g. reversing backwards, turning in place) may be challenging to represent and execute in image space.
  • Temporal consistency: Our framework does not enforce temporal consistency across timesteps, which may result in unstable asynchronous behavior.