Navigation World Models

Amir Bar

^{1}

Gaoyue Zhou

^{2}

Danny Tran

^{3}

Trevor Darrell

^{3}

Yann LeCun

^{1,2}

^{1}

FAIR at Meta

^{2}

New York University

^{3}

Berkeley AI Research

Abstract

Navigation is a fundamental skill of agents with visual-motor capabilities. We introduce a Navigation World Model (NWM), a controllable video generation model that predicts future visual observations based on past observations and navigation actions. To capture complex environment dynamics, NWM employs a Conditional Diffusion Transformer (CDiT), trained on a diverse collection of egocentric videos of both human and robotic agents, and scaled up to 1 billion parameters. In familiar environments, NWM can plan navigation trajectories by simulating them and evaluating whether they achieve the desired goal. Unlike supervised navigation policies with fixed behavior, NWM can dynamically incorporate constraints during planning. Experiments demonstrate its effectiveness in planning trajectories from scratch or by ranking trajectories sampled from an external policy. Furthermore, NWM leverages its learned visual priors to imagine trajectories in unfamiliar environments from a single input image, making it a flexible and powerful tool for next-generation navigation systems¹.

Project page: https://amirbar.net/nwm

We train a Navigation World Model (NWM) from video footage of robots and their associated navigation actions (a). After training, NWM can evaluate trajectories by synthesizing their videos and scoring the final frame's similarity with the goal (b). We use NWM to plan from scratch or rank experts navigation trajectories, improving downstream visual navigation performance. In unknown environments, NWM can simulate imagined trajectories from a single image (c). In all examples above, the input to the model is the first image and actions, then the model auto-regressively synthesizes future observations. Click on the image to view examples in a browser.

1. Introduction

Navigation is a fundamental skill for any organism with vision, playing a crucial role in survival by allowing agents to locate food, shelter, and avoid predators. In order to successfully navigate environments, smart agents primarily rely on vision, allowing them to construct representations of their surroundings to assess distances and capture landmarks in the environment, all useful for planning a navigation route.

When human agents plan, they often imagine their future trajectories considering constraints and counterfactuals. On the other hand, current state-of-the-art robotics navigation policies ([1, 2]) are "hard-coded", and after training, new constraints cannot be easily introduced (e.g. "no left turns"). Another limitation of current supervised visual navigation models is that they cannot dynamically allocate more computational resources to address hard problems. We aim to design a new model that can mitigate these issues.

In this work, we propose a Navigation World Model (NWM), trained to predict the future representation of a video frame based on past frame representation(s) and action(s) (see Figure 1(a)). NWM is trained on video footage and navigation actions collected from various robotic agents. After training, NWM is used to plan novel navigation trajectories by simulating potential navigation plans and verifying if they reach a target goal (see Figure 1(b)). To evaluate its navigation skills, we test NWM in known environments, assessing its ability to plan novel trajectories either independently or by ranking an external navigation policy. In the planning setup, we use NWM in a Model Predictive Control (MPC) framework, optimizing the action sequence that enables NWM to reach a target goal. In the ranking setup, we assume access to an existing navigation policy, such as NoMaD ([1]), which allows us to sample trajectories, simulate them using NWM, and select the best ones. Our NWM achieves state-of-the-art standalone performance and competitive results when combined with existing methods.

NWM is conceptually similar to recent diffusion-based world models for offline model-based reinforcement learning, such as DIAMOND ([3]) and GameNGen ([4]). However, unlike these models, NWM is trained across a wide range of environments and embodiments, leveraging the diversity of navigation data from robotic and human agents. This allows us to train a large diffusion transformer model capable of scaling effectively with model size and data to adapt to multiple environments. Our approach also shares similarities with Novel View Synthesis (NVS) methods like NeRF ([5]), Zero-1-2-3 ([6]), and GDC ([7]), from which we draw inspiration. However, unlike NVS approaches, our goal is to train a single model for navigation across diverse environments and model temporal dynamics from natural videos, without relying on 3D priors.

To learn a NWM, we propose a novel Conditional Diffusion Transformer (CDiT), trained to predict the next image state given past image states and actions as context. Unlike a DiT ([8]), CDiT’s computational complexity is linear with respect to the number of context frames, and it scales favorably for models trained up to

1 B

parameters across diverse environments and embodiments, requiring

4×4\times

fewer FLOPs compared to a standard DiT while achieving better future prediction results.

In unknown environments, our results show that NWM benefits from training on unlabeled, action- and reward-free video data from Ego4D. Qualitatively, we observe improved video prediction and generation performance on single images (see Figure 1(c)). Quantitatively, with additional unlabeled data, NWM produces more accurate predictions when evaluated on the held-out Stanford Go ([9]) dataset.

Our contributions are as follows. We introduce a Navigation World Model (NWM) and propose a novel Conditional Diffusion Transformer (CDiT), which scales efficiently up to

1 B

parameters with significantly reduced computational requirements compared to standard DiT. We train CDiT on video footage and navigation actions from diverse robotic agents, enabling planning by simulating navigation plans independently or alongside external navigation policies, achieving state-of-the-art visual navigation performance. Finally, by training NWM on action- and reward-free video data, such as Ego4D, we demonstrate improved video prediction and generation performance in unseen environments.

2. Related Work

Goal conditioned visual navigation is an important task in robotics requiring both perception and planning skills ([1, 10, 11, 12, 13, 14, 15]). Given context image(s) and an image specifying the navigation goals, goal-conditioned visual navigation models ([1, 10]) aim to generate a viable path towards the goal if the environment is known, or to explore it otherwise. Recent visual navigation methods like NoMaD ([1]) train a diffusion policy via behavior cloning and temporal distance objective to follow goals in the conditional setting or to explore new environments in the unconditional setting. Previous approaches like Active Neural SLAM ([13]) used neural SLAM together with analytical planners to plan trajectories in the

3 D

environment, while other approaches like ([16]) learn policies via reinforcement learning. Here we show that world models can use exploratory data to plan or improve existing navigation policies.

Differently than in learning a policy, the goal of a world model ([17]) is to simulate the environment, e.g. given the current state and action to predict the next state and an associated reward. Previous works have shown that jointly learning a policy and a world model can improve sample efficiency on Atari ([18, 19, 3]), simulated robotics environments ([20]), and even when applied to real world robots ([21]). More recently, [22] proposed to use a single world model that is shared across tasks by introducing action and task embeddings while [23, 24] proposed to describe actions in language, and [25] proposed to learn latent actions. World models were also explored in the context of game simulation. DIAMOND ([3]) and GameNGen ([4]) propose to use diffusion models to learn game engines of computer games like Atari and Doom. Our work is inspired by these works, and we aim to learn a single general diffusion video transformer that can be shared across many environments and different embodiments for navigation.

In computer vision, generating videos has been a long standing challenge ([26, 27, 28, 29, 30, 31, 32]). Most recently, there has been tremendous progress with text-to-video synthesis with methods like Sora ([33]) and MovieGen ([34]). Past works proposed to control video synthesis given structured action-object class categories ([35]) or Action Graphs ([36]). Video generation models were previously used in reinforcement learning as rewards ([37]), pretraining methods ([38]), for simulating and planning manipulation actions ([39, 40]) and for generating paths in indoor environments ([41, 42]). Interestingly, diffusion models ([43, 44]) are useful both for video tasks like generation ([45]) and prediction ([46]), but also for view synthesis ([47, 48, 49]). Differently, we use a conditional diffusion transformer to simulate trajectories for planning without explicit

3

D representations or priors.

3. Navigation World Models

3.1 Formulation

Next, we turn to describe our NWM formulation. Intuitively, a NWM is a model that receives the current state of the world (e.g. an image observation) and a navigation action describing where to move and how to rotate. The model then produces the next state of the world with respect to the agent's point of view.

We are given an egocentric video dataset together with agent navigation actions

\{(x_0, a_0, ..., x_T, a_T)\}^{n}_{i=1}

, such that

xi∈RH×W×3x_i\in \mathbb{R}^{H\times W \times 3}

is an image and

ai=(u,ϕ)a_i=(u, \phi)

is a navigation command given by translation parameter

u∈R2u\in\mathbb{R}^{2}

that controls the change in forward/backward and right/left motion, as well as

ϕ∈R\phi \in \mathbb{R}

that controls the change in yaw rotation angle.¹

This can be naturally extended to three dimensions by having

u\in\mathbb{R}^{3}

and

\theta\in\mathbb{R}^3

defining yaw, pitch and roll. For simplicity, we assume navigation on a flat surface with fixed pitch and roll.

The navigation actions

{a_i}

can be fully observed (as in Habitat ([50])), e.g. moving forward towards a wall will trigger a response from the environment based on physics, which will lead to the agent staying in place, whereas in other environments the navigation actions can be approximated based on the change in the agent's location.

Our goal is to learn a world model

F

, a stochastic mapping from previous latent observation(s)

sτ\mathbf{s}_\tau

and action

aτa_\tau

to future latent state representation

s_{t+1}

Where

sτ=(sτ,...,sτ−m)\mathbf{s_\tau}=({s_\tau, ..., s_{\tau-m}})

are the past

m

visual observations encoded via a pretrained VAE ([27]). Using a VAE has the benefit of working with compressed latents, allowing to decode predictions back to pixel space for visualization.

Due to the simplicity of this formulation, it can be naturally shared across environments and easily extended to more complex action spaces, like controlling a robotic arm. Different than [19], we aim to train a single world model across environments and embodiments, without using task or action embeddings like in [22].

The formulation in Equation 1 models action but does not allow control over the temporal dynamics. We extend this formulation with a time shift input

k∈[Tmin,Tmax]k\in[{T_\text{min}}, {T_\text{max}}]

, setting

aτ=(u,ϕ,k)a_\tau = (u, \phi, k)

, thus now

aτa_\tau

specifies the time change

k

, used to determine how many steps should the model move into the future (or past). Hence, given a current state

sτs_\tau

, we can randomly choose a timeshift

k

and use the corresponding time shifted video frame as our next state

sτ+1s_{\tau+1}

. The navigation actions can then be approximated to be a summation from time

τ\tau

m=τ+k−1m=\tau+k-1

This formulation allows learning both navigation actions, but also the environment temporal dynamics. In practice, we allow time shifts of up to

±16\pm16

seconds.

One challenge that may arise is the entanglement of actions and time. For example, if reaching a specific location always occurs at a particular time, the model may learn to rely solely on time and ignore the subsequent actions, or vice versa. In practice, the data may contain natural counterfactuals—such as reaching the same area at different times. To encourage these natural counterfactuals, we sample multiple goals for each state during training. We further explore this approach in Section 4.

3.2 Diffusion Transformer as World Model

As mentioned in the previous section, we design

FθF_{\theta}

as a stochastic mapping so it can simulate stochastic environments. This is achieved using a Conditional Diffusion Transformer (CDiT) model, described next.

Conditional Diffusion Transformer Architecture. The architecture we use is a temporally autoregressive transformer model utilizing the efficient CDiT block (see Figure 2), which is applied

×N\times N

times over the input sequence of latents with input action conditioning.

CDiT enables time-efficient autoregressive modeling by constraining the attention in the first attention block only to tokens from the target frame which is being denoised. To condition on tokens from past frames, we incorporate a cross-attention layer, such that every query token from the current target attends to tokens from past frames, which are used as keys and values. The cross-attention then contextualizes the representations using a skip connection layer.

To condition on the navigation action

a∈R3a\in\mathbb{R}^3

, we first map each scalar to

Rd3\mathbb{R}^\frac{d}{3}

by extracting sine-cosine features, then applying a

2

-layer

MLP\text{MLP}

, and concatenating them into a single vector

ψa∈Rd\psi_a \in \mathbb{R}^d

. We follow a similar process to map the timeshift

k∈Rk\in\mathbb{R}

ψk∈Rd\psi_k\in\mathbb{R}^d

and the diffusion timestep

t∈Rt\in\mathbb{R}

ψk∈Rd\psi_k\in\mathbb{R}^d

. Finally we sum all embeddings into a single vector used for conditioning:

ξ\xi

is then fed to an AdaLN ([51]) block to generate scale and shift coefficients that modulate the Layer Normalization ([52]) outputs, as well as the outputs of the attention layers. To train on unlabeled data, we simply omit explicit navigation actions when computing

ξ\xi

(see Equation 3).

An alternative approach is to simply use DiT ([8]), however, applying a DiT on the full input is computationally expensive. Denote

n

the number of input tokens per frame, and

m

the number of frames, and

d

the token dimension. Scaled Multi-head Attention Layer ([53]) complexity is dominated by the attention term

O(m^2n^2d)

, which is quadratic with context length. In contrast, our CDiT block is dominated by the cross-attention layer complexity

O(mn^2d)

, which is linear with respect to the context, allowing us to use longer context size. We analyze these two design choices in Section 4. CDiT resembles the original Transformer Block ([53]), without applying expensive self-attention over the context tokens.

Diffusion Training. In the forward process, noise is added to the target state

sτ+1s_{\tau+1}

according to a randomly chosen timestep

\in \{1, \dots, T\}

. The noisy state

sτ+1(t)s_{\tau+1}^{(t)}

can be defined as:

sτ+1(t)=αtsτ+1+1−αtϵs_{\tau+1}^{(t)} = \sqrt{\alpha_t} s_{\tau+1} + \sqrt{1 - \alpha_t} \epsilon

, where

ϵ∼N(0,I)\epsilon \sim \mathcal{N}(0, I)

is Gaussian noise, and

{αt}\{\alpha_t\}

is a noise schedule controlling the variance. As

t

increases,

sτ+1(t)s_{\tau+1}^{(t)}

converges to pure noise. The reverse process attempts to recover the original state representation

sτ+1s_{\tau+1}

from the noisy version

sτ+1(t)s_{\tau+1}^{(t)}

, conditioned on the context

sτ\mathbf{s}_{\tau}

, the current action

aτa_\tau

, and the diffusion timestep

t

. We define

Fθ(sτ+1∣sτ,aτ,t)F_\theta(s_{\tau+1} | \mathbf{s}_{\tau}, a_\tau, t)

as the denoising neural network model parameterized by

θ\theta

. We follow the same noise schedule and hyperparams of DiT ([8]).

Training Objective. The model is trained to minimize the mean-squared between the clean and predicted target, aiming to learn the denoising process:

In this objective, the timestep

t

is sampled randomly to ensure that the model learns to denoise frames across varying levels of corruption. By minimizing this loss, the model learns to reconstruct

sτ+1s_{\tau+1}

from its noisy version

sτ+1(t)s_{\tau+1}^{(t)}

, conditioned on the context

sτ\mathbf{s}_{\tau}

and action

aτa_{\tau}

, thereby enabling the generation of realistic future frames. Following ([8]), we also predict the covariance matrix of the noise and supervise it with the variational lower bound loss

Lvlb\mathcal{L}_\text{vlb}

[54].

3.3 Navigation Planning with World Models

Here we move to describe how to use a trained NWM to plan navigation trajectories. Intuitively, if our world model is familiar with an environment, we can use it to simulate navigation trajectories, and choose the ones which reach the goal. In an unknown, out of distribution environments, long term planning might rely on imagination.

Formally, given the latent encoding

s_0

and navigation target

s^*

, we look for a sequence of actions

a_0, ..., a_{T-1})

that maximizes the likelihood of reaching

s^*

. Let

S(sT,s∗)\mathcal{S}({s}_T, s^*)

represent the unnormalized score for reaching state

s^*

with

s_T

given the initial condition

s_0

, actions

a=(a0,…,aT−1)\mathbf{a}=(a_0, \dots, a_{T-1})

, and states

s=(s1,…sT)\mathbf{s}=({s}_1, \dots {s}_T)

obtained by autoregressively rolling out the NWM:

s∼Fθ(⋅∣s0,a)\mathbf{s}\sim F_{\theta}(\cdot|s_0, \mathbf{a})

We define the energy function

E(s0,a0,…,aT−1,sT)\mathcal{E}(s_0, a_0, \dots, a_{T-1}, s_T)

, such that minimizing the energy corresponds to maximizing the unnormalized perceptual similarity score and following potential constraints on the states and actions:

\begin{align} \mathcal{E}(s_0, a_0, \dots, a_{T-1}, s_T) = -\mathcal{S}(s_T, s^*) + && \\ \nonumber + \sum_{\tau=0}^{T-1} \mathbb{I}(a_\tau \notin \mathcal{A}_{\text{valid}}) + \sum_{\tau=0}^{T-1} \mathbb{I}(s_\tau \notin \mathcal{S}_{\text{safe}}), \end{align}\tag{4}

💭 Click to ask about this equation

The similarity is computed by decoding

s^*

and

s_T

to pixels using a pretrained VAE decoder ([27]) and then measuring the perceptual similarity ([55, 56]). Constraints like "never go left then right" can be encoded by constraining

aτa_\tau

to be in a valid action set

Avalid\mathcal{A}_{\text{valid}}

, and "never explore the edge of the cliff" by ensuring such states

sτs_\tau

are in

Ssafe\mathcal{S}_{\text{safe}}

I(⋅)\mathbb{I}(\cdot)

denotes the indicator function that applies a large penalty if any action or state constraint is violated.

The problem then reduces to finding the actions that minimize this energy function:

This objective can be reformulated as a Model Predictive Control (MPC) problem, and we optimize it using the Cross-Entropy Method ([57]), a simple derivative-free and population-based optimization method which was recently used with with world models for planning ([58]). We include an overview of the Cross-Entropy Method and the full optimization technical details in Section 7.

Ranking Navigation Trajectories. Assuming we have an existing navigation policy

Π(a∣s0,s∗)\Pi(\mathbf{a}|s_0, s^*)

, we can use NWMs to rank sampled trajectories. Here we use NoMaD ([1]), a state-of-the-art navigation policy for robotic navigation. To rank trajectories, we draw multiple samples from

Π\Pi

and choose the one with the lowest energy, like in Equation 5.

4. Experiments and Results

We describe the experimental setting, our design choices, and compare NWM to previous approaches. Additional results are included in the Supplementary Material.

Table 1: Ablations of predicted goals per sample number, context size, and the use of action and time conditioning. We report prediction results $4$ seconds into the future on RECON.

ablation	lpips $\downarrow$	dreamsim $\downarrow$	psnr $\uparrow$
1	$0.312 \pm 0.001$	$0.098 \pm 0.001$	$15.044 \pm 0.031$
2 #goals	$0.305 \pm 0.000$	$0.096 \pm 0.001$	$15.154 \pm 0.017$
4	$\textbf{0.296}$ $\pm 0.002$	$\textbf{0.091}$ $\pm 0.001$	$\textbf{15.331}$ $\pm 0.027$
1	$0.304 \pm 0.001$	$0.097 \pm 0.001$	$15.223 \pm 0.033$

4.1 Experimental Setting

Datasets. For all robotics datasets (SCAND ([59]), TartanDrive ([60]), RECON ([61]), and HuRoN ([62])), we have access to the location and rotation of robots, allowing us to infer relative actions compare to current location (see Equation 2). To standardize the step size across agents, we divide the distance agents travel between frames by their average step size in meters, ensuring the action space is similar for different agents. We further filter out backward movements, following NoMaD ([1]). Additionally, we use unlabeled Ego4D ([63]) videos, where the only action we consider is time shift. SCAND provides video footage of socially compliant navigation in diverse environments, TartanDrive focuses on off-road driving, RECON covers open-world navigation, HuRoN captures social interactions. We train on unlabeled Ego4D videos and GO Stanford ([9]) serves as an unknown evaluation environment. For the full details, see Section 8.1.

Evaluation Metrics. We evaluate predicted navigation trajectories using Absolute Trajectory Error (ATE) for accuracy and Relative Pose Error (RPE) for pose consistency ([64]). To check how semantically similar are world model predictions to ground truth images, we apply LPIPS ([65]) and DreamSim ([56]), measuring perceptual similarity by comparing deep features, and PSNR for pixel-level quality. For image and video synthesis quality, we use FID ([66]) and FVD ([67]) which evaluate the generated data distribution. See Section 8.1 for more details.

Baselines. We consider all the following baselines.

DIAMOND ([3]) is a diffusion world model based on the UNet ([68]) architecture. We use DIAMOND in the offline-reinforcement learning setting following their public code. The diffusion model is trained to autoregressively predict at $56$ x $56$ resolution alongside an upsampler to obtrain $224$ x $224$ resolution predictions. To condition on continuous actions, we use a linear embedding layer.
GNM ([2]) is a general goal-conditioned navigation policy trained on a dataset soup of robotic navigation datasets with a fully connected trajectory prediction network. GNM is trained on multiple datasets including SCAND, TartanDrive, GO Stanford, and RECON.
NoMaD ([1]) extends GNM using a diffusion policy for predicting trajectories for robot exploration and visual navigation. NoMaD is trained on the same datasets used by GNM and on HuRoN.

Implementation Details. In the default experimental setting we use a CDiT-XL of

1 B

parameters with context of

4

frames, a total batch size of

1024

, and

4

different navigation goals, leading to a final total batch size of

4096

. We use the Stable Diffusion ([27]) VAE tokenizer, similar as in DiT ([8]). We use the AdamW ([69]) optimizer with a learning rate of

8 e - 5

. After training, we sample

5

times from each model to report mean and std results. XL sized model are trained on

8

H100 machines, each with

8

GPUs. Unless otherwise mentioned, we use the same setting as in DiT-*/2 models.

4.2 Ablations

Models are evaluated on single-step

4

seconds future prediction on validation set trajectories on the known environment RECON. We evaluate the performance against the ground truth frame by measuring LPIPS, DreamSim, and PSNR. We provide qualitative examples in Figure 3.

Model Size and CDiT. We compare CDiT (see Section 3.2) with a standard DiT in which all context tokens are fed as inputs. We hypothesize that for navigating known environments, the capacity of the model is the most important, and the results in Figure 5, indicate that CDiT indeed performs better with models of up to

1

B parameters, while consuming less than

2×2\times

FLOPs. Surprisingly, even with equal amount of parameters (e.g, CDiT-L compared to DiT-XL), CDiT is

4×4\times

faster and performs better.

Number of Goals. We train models with variable number of goal states given a fixed context, changing the number of goals from

1

4

. Each goal is randomly chosen between

±16\pm16

seconds window around the current state. The results reported in Table 1 indicate that using

4

goals leads to significantly improved prediction performance in all metrics.

Context Size. We train models while varying the number of conditioning frames from

1

4

(see Table 1). Unsurprisingly, more context helps, and with short context the model often "lose track", leading to poor predictions.

Time and Action Conditioning. We train our model with both time and action conditioning and test how much each input contributes to the prediction performance (we include the results in Table 1. We find that running the model with time only leads to poor performance, while not conditioning on time leads to small drop in performance as well. This confirms that both inputs are beneficial to the model.

Table 3: Goal Conditioned Visual Navigation. ATE and RPE results on RECON, predicting $2$ second trajectories. NWM achieves improved results on all metrics compared to previous approaches NoMaD ([1]) and GNM ([2]).

model	ATE $\downarrow$	RPE $\downarrow$
GNM	$1.87$ $\pm$ $0.00$	$0.73$ $\pm$ $0.00$
NoMaD	$1.93$ $\pm$ $0.04$	$0.52$ $\pm$ $0.00$
NWM + NoMaD ( $\times 16$ )	$1.83$ $\pm$ $0.03$	$0.50$ $\pm$ $0.01$
NWM + NoMaD ( $\times 32$ )	$1.78$ $\pm$ $0.03$	$0.48$ $\pm$ $0.01$
NWM (planning)	$\textbf{1.13}$ $\pm$ $0.02$	$\textbf{0.35}$ $\pm$ $0.01$

Table 4: Planning with Navigation Constraints. We present results for planning with NWM under three action constraints, reporting the differences in final position ( $\delta u$ ) and yaw ( $\delta \phi$ ) relative to the no-constraints baseline. All constraints are met, demonstrating that NWM can effectively adhere to them.

model	Rel. $\delta u$ $\downarrow$	Rel. $\delta \phi$ $\downarrow$
forward first	$+0.36 \pm 0.01$	$+0.61 \pm 0.02$
left-right first	$-0.03 \pm 0.01$	$+0.20 \pm 0.01$
straight then forward	$+0.08 \pm 0.01$	$+0.22 \pm 0.01$

4.3 Video Prediction and Synthesis

We evaluate how well our model follows ground truth actions and predicts future states. The model is conditioned on the first image and context frames, then autoregressively predicts the next state using ground truth actions, feeding back each prediction. We compare predictions to ground truth images at

1

2

4

8

, and

16

seconds, reporting FID and LPIPS on the RECON dataset. Figure 4 shows performance over time compared to DIAMOND at

4

FPS and

1

FPS, showing that NWM predictions are significantly more accurate than DIAMOND. Initially, the NWM

1

FPS variant performs better, but after

8

seconds, predictions degrade due to accumulated errors and loss of context and the

4

FPS becomes superior. See qualitative examples in Figure 3.

Generation Quality. To evaluate video quality, we auto-regressively predict videos at

4

FPS for

16

seconds to create videos, while conditioning on ground truth actions. We then evaluate the quality of videos generated using FVD, compared to DIAMOND ([3]). The results in Figure Table 2 indicate that NWM outputs higher quality videos.

Table 5: **Training on additional unlabeled data improves performance on unseen environments. ** Reporting results on unknown environment (Go Stanford) and known one (RECON). Results reported by evaluating $4$ seconds into the future.

data	unknown environment (Go Stanford)			known environment (RECON)
	lpips $\downarrow$	dreamsim $\downarrow$	psnr $\uparrow$	lpips $\downarrow$	dreamsim $\downarrow$	psnr $\uparrow$
in-domain data	$0.658 \pm 0.002$	$0.478 \pm 0.001$	$11.031 \pm 0.036$	$\textbf{0.295}$ $\pm 0.002$	$\textbf{0.091}$ $\pm 0.001$	$\textbf{15.343}$ $\pm 0.060$
+ Ego4D (unlabeled)	$\textbf{0.652}$ $\pm 0.003$	$\textbf{0.464}$ $\pm 0.003$	$\textbf{11.083}$ $\pm 0.064$	$0.368 \pm 0.003$	$0.138 \pm 0.002$	$14.072 \pm 0.075$

4.4 Planning Using a Navigation World Model

Next, we turn to describe experiments that measure how well can we navigate using a NWM. We include the full technical details of the experiments in Section 8.2.

Standalone Planning. We demonstrate that NWM can be effectively used independently for goal-conditioned navigation. We condition it on past observations and a goal image, and use the Cross-Entropy Method to find a trajectory that minimizes the LPIPS similarity of the last predicted image to the goal image (see Equation 5). To rank an action sequence, we execute the NWM and measure LPIPS between the last state and the goal

3

times to get an average score. We generate trajectories of length

8

, with temporal shift of

k = 0.25

. We evaluate the model performance in Table 3. We find that using a NWM for planning leads to competitive results with state-of-the-art policies.

Planning with Constraints. World models allow planning under constraints—for example, requiring straight motion or a single turn. We show that NWM supports constraint-aware planning. In forward-first, the agent moves forward for 5 steps, then turns for 3. In left-right first, it turns for 3 steps before moving forward. In straight then forward, it moves straight for 3 steps, then forward. Constraints are enforced by zeroing out specific actions; e.g., in left-right first, forward motion is zeroed for the first 3 steps, and Standalone Planning optimizes the rest. We report the norm of the difference in final position and yaw relative to unconstrained planning. Results (Table 4) show NWM plans effectively under constraints, with only minor performance drops (see examples in Figure 8).

Using a Navigation World Model for Ranking. NWM can enhance existing navigation policies in a goal-conditioned navigation. Conditioning NoMaD on past observations and a goal image, we sample

\in \{16, 32\}

trajectories, each of length

8

, and evaluate them by autoregressively following the actions using NWM. Finally, we rank each trajectory's final prediction by measuring LPIPS similarity with the goal image (see Figure 6). We report ATE and RPE on all in-domain datasets (Table 3) and find that NWM-based trajectory ranking improves navigation performance, with more samples yielding better results.

4.5 Generalization to Unknown Environments

Here we experiment with adding unlabeled data, and ask whether NWM can make predictions in new environments using imagination. In this experiment, we train a model on all in-domain datasets, as well as a susbet of unlabeled videos from Ego4D, where we only have access to the time-shift action. We train a CDiT-XL model and test it on the Go Stanford dataset as well as other random images. We report the results in Table 5, finding that training on unlabeled data leads to significantly better video predictions according to all metrics, including improved generation quality. We include qualitative examples in Figure 7. Compared to in-domain (Figure 3), the model breaks faster and expectedly hallucinates paths as it generates traversals of imagined environments.

5. Limitations

We identify multiple limitations. First, when applied to out of distribution data, the model tends to slowly lose context and generates next states that resemble the training data, a phenomena that was observed in image generation and is known as mode collapse ([70, 71]). We include such an example in Figure 9. Second, while the model can plan, it struggles with simulating temporal dynamics like pedestrian motion (although in some cases it does). Both limitations are likely to be solved with longer context and more training data. Additionally, the model currently utilizes

3

DoF navigation actions, but extending to

6

DoF navigation and potentially more (like controlling the joints of a robotic arm) are possible as well, which we leave for future work.

6. Discussion

Our proposed Navigation World Model (NWM) offers a scalable, data-driven approach to learning world models for visual navigation; However, we are not exactly sure yet what representations enable this, as our NWM does not explicitly utilize a structured map of the environment. One idea, is that next frame prediction from an egocentric point of view can drive the emergence of allocentric representations [72]. Ultimately, our approach bridges learning from video, visual navigation, and model-based planning and could potentially open the door to self-supervised systems that not only perceive but can also plan to inform action.

Acknowledgments. We thank Noriaki Hirose for his help with the HuRoN dataset and for sharing his insights, and to Manan Tomar, David Fan, Sonia Joseph, Angjoo Kanazawa, Ethan Weber, Nicolas Ballas, and the anonymous reviewers for their helpful discussions and feedback.

Navigation World Models

Supplementary Material

The structure of the Appendix is as follows: we start by describing how we plan navigation trajectories via Standalone Planning in Section 7, and then include more experiments and results in Section 8.

7. Standalone Planning Optimization

As described in Section 3.3, we use a pretrained NWM to standalone-plan goal-conditioned navigation trajectories by optimizing Equation 5. Here, we provide additional details about the optimization using the Cross-Entropy Method ([57]) and the hyperparameters used. Full standalone navigation planning results are presented in Section 8.2.

We optimize trajectories using the Cross-Entropy Method, a gradient-free stochastic optimization technique for continuous optimization problems. This method iteratively updates a probability distribution to improve the likelihood of generating better solutions. In the unconstrained standalone planning scenario, we assume the trajectory is a straight line and optimize only its endpoint, represented by three variables: a single translation

u

and yaw rotation

ϕ\phi

. We then map this tuple into eight evenly spaced delta steps, applying the yaw rotation at the final step. The time interval between steps is fixed at

k = 0.25

seconds. The main steps of our optimization process are as follows:

Initialization: Define a Gaussian distribution with mean $μ=(μΔx,μΔy,μϕ)\mu = (\mu_{\Delta x}, \mu_{\Delta y}, \mu_\phi)$ and variance $Σ=diag(σΔx2,σΔy2,σϕ2)\Sigma = \mathrm{diag}(\sigma_{\Delta x}^2, \sigma_{\Delta y}^2, \sigma_\phi^2)$ over the solution space.
Sampling: Generate $N = 120$ candidate solutions by sampling from the current Gaussian distribution.
Evaluation: Evaluate each candidate solution by simulating it using the NWM and measuring the LPIPS score between the simulation output and input goal images. Since NWM is stochastic, we evaluate each candidate solution $M$ times and average to obtain a final score.
Selection: Select a subset of the best-performing solutions based on the LPIPS scores.
Update: Adjust the parameters of the distribution to increase the probability of generating solutions similar to the top-performing ones. This step minimizes the cross-entropy between the old and updated distributions.
Iteration: Repeat the sampling, evaluation, selection, and update steps until a stopping criterion (e.g. convergence or iteration limit) is met.

For simplicity, we run the optimization process for a single iteration, which we found effective for short-horizon planning of two seconds, though further improvements are possible with more iterations. When navigation constraints are applied, parts of the trajectory are zeroed out to respect these constraints. For instance, in the "forward-first" scenario, the translation action is

u=(Δx,0)u=(\Delta x, 0)

for the first five steps and

\Delta y)

for the last three steps.

Table 6: **Training on additional unlabeled data improves performance on unseen environments. ** Reporting results on unknown environment (Go Stanford) and known one (RECON). Results reported by evaluating LPIPS $4$ seconds into the future.

	unknown environment	known environments
data	Go Stanford	RECON	HuRoN	SCAND	TartanDrive
in-domain data	$0.658 \pm 0.002$	$\textbf{0.295} \pm 0.002$	$\textbf{0.250} \pm 0.003$	$0.403 \pm 0.002$	$\textbf{0.414} \pm 0.001$
+ Ego4D (unlabeled)	$\textbf{0.652} \pm 0.003$	$0.368 \pm 0.003$	$0.377 \pm 0.002$	$\textbf{0.398} \pm 0.001$	$0.430 \pm 0.000$

8. Experiments and Results

8.1 Experimental Study

We elaborate on the metrics and datasets used.

Evaluation Metrics. We describe the evaluation metrics used to assess predicted navigation trajectories and the quality of images generated by our NWM.

For visual navigation performance, Absolute Trajectory Error (ATE) measures the overall accuracy of trajectory estimation by computing the Euclidean distance between corresponding points in the estimated and ground-truth trajectories. Relative Pose Error (RPE) evaluates the consistency of consecutive poses by calculating the error in relative transformations between them ([64]).

To more rigorously assess the semantics in the world model outputs, we use Learned Perceptual Image Patch Similarity (LPIPS) and DreamSim ([56]), which evaluate perceptual similarity by comparing deep features from a neural network ([55]). LPIPS, in particular, uses AlexNet ([73]) to focus on human perception of structural differences. Additionally, we use Peak Signal-to-Noise Ratio (PSNR) to quantify the pixel-level quality of generated images by measuring the ratio of maximum pixel value to error, with higher values indicating better quality.

To study image and video synthesis quality, we use Fréchet Inception Distance (FID) and Fréchet Video Distance (FVD), which compare the feature distributions of real and generated images or videos. Lower FID and FVD scores indicate higher visual quality ([66, 67]).

Datasets. For all robotics datasets, we have access to the location and rotation of the robots, and we use this to infer the actions as the delta in location and rotation. We remove all backward movement which can be jittery following NoMaD ([1]), thereby splitting the data to forward walking segments for SCAND ([59]), TartanDrive ([60]), RECON ([61]), and HuRoN ([62]). We also utilize unlabeled Ego4D videos, where we only use time shift as action. Next, we describe each individual dataset.

SCAND ([59]) is a robotics dataset consisting of socially compliant navigation demonstrations using a wheeled Clearpath Jackal and a legged Boston Dynamics Spot. SCAND has demonstrations in both indoor and outdoor settings at UT Austin. The dataset consists of $8.7$ hours, $138$ trajectories, $25$ miles of data and we use the corresponding camera poses. We use $484$ video segments for training and $121$ video segments for testing. Used for training and evaluation.
TartanDrive ([60]) is an outdoor off-roading driving dataset collected using a modified Yamaha Viking ATV in Pittsburgh. The dataset consists of $5$ hours and $630$ trajectories. We use $1, 000$ video segments for training and $251$ video segments for testing.
RECON ([61]) is an outdoor robotics dataset collected using a Clearpath Jackal UGV platform. The dataset consists of $40$ hours across $9$ open-world environments. We use $9, 468$ video segments for training and $2, 367$ video segments for testing. Used for training and evaluation.
HuRoN ([62]) is a robotics dataset consisting of social interactions using a Robot Roomba in indoor settings collected at UC Berkeley. The dataset consists of over $75$ hours in $5$ different environments with $4, 000$ human interactions. We use $2, 451$ video segments for training and $613$ video segments for testing. Used for training and evaluation.
GO Stanford ([9, 74]), a robotics datasets capturing the fisheye video footage of two different teleoperated robots, collected at at least $27$ different Stanford building with around $25$ hours of video footage. Due to the low resolution images, we only use it for out of domain evaluation.
Ego4D ([63]) is a large-scale egocentric dataset consisting of $3, 670$ hours across $74$ locations. Ego4D consists a variety of scenarios such as Arts $&\&$ Crafts, Cooking, Construction, Cleaning $&\&$ Laundry, and Grocery Shopping. We use only use videos which involve visual navigation such as Grocery Shopping and Jogging. We use a total $1619$ videos of over $908$ hours for training only. Only used for unlabeled training unlabeled training. The videos we use are from the following Ego4D scenarios: "Skateboard/scooter", "Roller skating", "Football", "Attending a festival or fair", "Gardener", "Mini golf", "Riding motorcycle", "Golfing", "Cycling/jogging", "Walking on street", "Walking the dog/pet", "Indoor Navigation (walking)", "Working in outdoor store", "Clothes/other shopping", "Playing with pets", "Grocery shopping indoors", "Working out outside", "Farmer", "Bike", "Flower Picking", "Attending sporting events (watching and participating)", "Drone flying", "Attending a lecture/class", "Hiking", "Basketball", "Gardening", "Snow sledding", "Going to the park".

Visual Navigation Evaluation Set. Our main finding when constructing visual navigation evaluation sets is that forward motion is highly prevalent, and if not carefully accounted for, it can dominate the evaluation data. To create diverse evaluation sets, we rank potential evaluation trajectories based on how well they can be predicted by simply moving forward. For each dataset, we select the

100

examples that are least predictable by this heuristic and use them for evaluation.

Time Prediction Evaluation Set. Predicting the future frame after

k

seconds is more challenging than estimating a trajectory, as it requires both predicting the agent's trajectory and its orientation in pixel space. Therefore, we do not impose additional diversity constraints. For each dataset, we randomly select

500

test prediction examples.

8.2 Experiments and Results

Training on Additional Unlabeled Data. We include results for additional known environments in Table 6 and Figure 10. We find that in known environments, models trained exclusively with in-domain data tend to perform better, likely because they are better tailored to the in-domain distribution. The only exception is the SCAND dataset, where dynamic objects (e.g. humans walking) are present. In this case, adding unlabeled data may help improve performance by providing additional diverse examples.

Known Environments. We include additional visualization results of following trajectories using NWM in the known environments RECON (Figure 11), SCAND (Figure 12), HuRoN (Figure 13), and Tartan Drive (Figure 14). Additionally, we include full FVD comparison of DIAMOND and NWM in Table 7.

Table 7: Comparison of Video Synthesis Quality. $16$ second videos generated at 4 FPS, reporting FVD (lower is better).

dataset	DIAMOND	NWM (ours)
RECON	$762.734 \pm 3.361$	$\textbf{200.969} \pm 5.629$
HuRoN	$881.981 \pm 11.601$	$\textbf{276.932} \pm 4.346$
TartanDrive	$2289.687 \pm 6.991$	$\textbf{494.247} \pm 14.433$

Table 8: Goal Conditioned Visual Navigation. ATE and RPE results on on all in domain datasets, predicting trajectories of up to $2$ seconds. NWM achieves improved results on all metrics compared to previous approaches NoMaD ([1]) and GNM ([2]).

model	RECON		HuRoN		Tartan		SCAND
	ATE	RTE	ATE	RTE	ATE	RTE	ATE	RTE
Forward	$1.92$ $\pm$ $0.00$	$0.54$ $\pm$ $0.00$	$4.14$ $\pm$ $0.00$	$1.05$ $\pm$ $0.00$	$5.75$ $\pm$ $0.00$	$1.19$ $\pm$ $0.00$	$2.97$ $\pm$ $0.00$	$0.62$ $\pm$ $0.00$
GNM	$1.87$ $\pm$ $0.00$	$0.73$ $\pm$ $0.00$	$3.71$ $\pm$ $0.00$	$1.00$ $\pm$ $0.00$	$6.65$ $\pm$ $0.00$	$1.62$ $\pm$ $0.00$	$2.12$ $\pm$ $0.00$	$0.61$ $\pm$ $0.00$
NoMaD	$1.95$ $\pm$ $0.05$	$0.53$ $\pm$ $0.01$	$3.73$ $\pm$ $0.04$	$0.96$ $\pm$ $0.01$	$6.32$ $\pm$ $0.03$	$1.31$ $\pm$ $0.01$	$2.24$ $\pm$ $0.03$	$0.49$ $\pm$ $0.01$

Planning (Ranking). Full goal-conditioned navigation results for all in-domain datasets are presented in Table 8. Compared to NoMaD, we observe consistent improvements when using NWM to select from a pool of

16

trajectories, with further gains when selecting from a larger pool of

32

. For Tartan Drive, we note that the dataset is heavily dominated by forward motion, as reflected in the results compared to the "Forward" baseline, a prediction model that always selects forward-only motion.

Standalone Planning. For standalone planning, we run the optimization procedure outlined in Section 7 for

1

step, and evaluate each trajectories for 3 times. For all datasets, we initialize

μΔy\mu_{\Delta y}

and

μϕ\mu_\phi

to be 0, and

σΔy2\sigma_{\Delta y}^2

and

σϕ2\sigma_\phi^2

to be 0.1. We use different

(μΔx,σΔx2)(\mu_{\Delta x}, \sigma_{\Delta x}^2)

across each dataset:

(- 0.1, 0.02)

for RECON,

(0.5, 0.07)

for TartanDrive,

(- 0.25, 0.04)

for SCAND, and

(- 0.33, 0.03)

for HuRoN. We include the full standalone navigation planning results in Table 8. We find that using planning in the stand-alone setting performs better compared to other approaches, and specifically previous hard-coded policies.

Real-World Applicability. A key bottleneck in deploying NWM in real-world robotics is inference speed. We evaluate methods to improve NWM efficiency and measure their impact on runtime. We focus on using NWM with a generative policy (Section 3.3) to rank

32

four-second trajectories. Since trajectory evaluation is parallelizable, we analyze the runtime of simulating a single trajectory. We find that existing solutions can already enable real-time applications of NWM at 2-10HZ (Table 9).

Table 9: Runtime (seconds) on an NVIDIA RTX 6000 Ada card.

NWM	+Time Skip	+Distillation.	+Quant. 4-bit
$30.3 \pm 0.2$	$14.7 \pm 0.1$	$0.4 \pm 0.1$	$0.1$ (est. [75])

Inference time can be accelerated by composing every adjacent pair of actions (via Equation 2) then simulating only

8

future states instead of

16

("Time Skip"), which does not degrade navigation performance. Reducing the diffusion denoising steps from

250

6

by model distillation [76] further speeds up inference with minor visual quality loss.³ Taken together, these two ideas can enable NWM to run in real time. Quantization to 4-bit, which we haven't explored, can lead to a

×4\times4

speedup without performance hit [75].

Using the distillation implementation for DiTs from https://github.com/hao-ai-lab/FastVideo

Test-time adaptation. Test-time adaptation has shown to improve visual navigation [15, 77]. What is the relation between planning using a world model and test-time adaptation? We hypothesize that the two ideas are orthogonal, and include test-time adaptation results. We consider a simplified adaptation approach by fine-tuning NWM for

2

k steps on trajectories from an unknown environment. We show that this adaptation improves trajectory simulation in this environment (see "ours+TTA" in Table 10), where we also include additional baselines and ablations.

Table 10: Results in unknown environment ("Go Stanford"). Reporting lpips on $4$ seconds future prediction. Lower is better.

CDiT-L	context 2	action only	goals 2	ours	ours + TTA
$0.656$	$0.655$	$0.661$	$0.654$	$0.652$	$\textbf{0.650}$

References

[1] Ajay Sridhar, Dhruv Shah, Catherine Glossop, and Sergey Levine. Nomad: Goal masked diffusion policies for navigation and exploration. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 63–70. IEEE, 2024.

[2] Dhruv Shah, Ajay Sridhar, Arjun Bhorkar, Noriaki Hirose, and Sergey Levine. Gnm: A general navigation model to drive any robot. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7226–7233. IEEE, 2023.

[3] Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari. In Thirty-eighth Conference on Neural Information Processing Systems.

[4] Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. arXiv preprint arXiv:2408.14837, 2024.

[5] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.

[6] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023.

[7] Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl Vondrick. Generative camera dolly: Extreme monocular dynamic novel view synthesis. 2024.

[8] William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4195–4205, 2023.

[9] Noriaki Hirose, Amir Sadeghian, Marynel Vázquez, Patrick Goebel, and Silvio Savarese. Gonet: A semi-supervised deep learning approach for traversability estimation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3044–3051. IEEE, 2018.

[10] Dhruv Shah, Ajay Sridhar, Nitish Dashora, Kyle Stachowicz, Kevin Black, Noriaki Hirose, and Sergey Levine. Vint: A foundation model for visual navigation. In 7th Annual Conference on Robot Learning.

[11] Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan Shelhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. Zero-shot visual imitation. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 2050–2053, 2018.

[12] Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andy Ballard, Andrea Banino, Misha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, et al. Learning to navigate in complex environments. In International Conference on Learning Representations, 2022.

[13] Devendra Singh Chaplot, Dhiraj Gandhi, Saurabh Gupta, Abhinav Gupta, and Ruslan Salakhutdinov. Learning to explore using active neural slam. In International Conference on Learning Representations.

[14] Zipeng Fu, Ashish Kumar, Ananye Agarwal, Haozhi Qi, Jitendra Malik, and Deepak Pathak. Coupling vision and proprioception for navigation of legged robots. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17273–17283, 2022.

[15] J Frey, M Mattamala, N Chebrolu, C Cadena, M Fallon, and M Hutter. Fast traversability estimation for wild visual navigation. Robotics: Science and Systems Proceedings, 19, 2023.

[16] Tao Chen, Saurabh Gupta, and Abhinav Gupta. Learning exploration policies for navigation. In International Conference on Learning Representations.

[17] David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.

[18] Danijar Hafner, Timothy P Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. In International Conference on Learning Representations, b.

[19] Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations, a.

[20] Younggyo Seo, Danijar Hafner, Hao Liu, Fangchen Liu, Stephen James, Kimin Lee, and Pieter Abbeel. Masked world models for visual control. In Conference on Robot Learning, pages 1332–1344. PMLR, 2023.

[21] Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. Daydreamer: World models for physical robot learning. In Conference on robot learning, pages 2226–2240. PMLR, 2023.

[22] Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control. In The Twelfth International Conference on Learning Representations.

[23] Sherry Yang, Yilun Du, Seyed Kamyar Seyed Ghasemipour, Jonathan Tompson, Leslie Pack Kaelbling, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. In The Twelfth International Conference on Learning Representations.

[24] Jessy Lin, Yuqing Du, Olivia Watkins, Danijar Hafner, Pieter Abbeel, Dan Klein, and Anca Dragan. Learning to model the world with language, 2024b.

[25] Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning, 2024.

[26] Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jose Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation. In Forty-first International Conference on Machine Learning.

[27] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.

[28] Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023.

[29] Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, et al. Magvit: Masked generative video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10459–10469, 2023.

[30] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.

[31] Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1526–1535, 2018b.

[32] Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945, 2024.

[33] Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators, 2024.

[34] Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720, 2024.

[35] Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. MoCoGAN: Decomposing motion and content for video generation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1526–1535, 2018a.

[36] Amir Bar, Roei Herzig, Xiaolong Wang, Anna Rohrbach, Gal Chechik, Trevor Darrell, and Amir Globerson. Compositional video synthesis with action graphs. In International Conference on Machine Learning, pages 662–673. PMLR, 2021.

[37] Alejandro Escontrela, Ademi Adeniji, Wilson Yan, Ajay Jain, Xue Bin Peng, Ken Goldberg, Youngwoon Lee, Danijar Hafner, and Pieter Abbeel. Video prediction models as rewards for reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.

[38] Manan Tomar, Philippe Hansen-Estruch, Philip Bachman, Alex Lamb, John Langford, Matthew E. Taylor, and Sergey Levine. Video occupancy models, 2024.

[39] Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2786–2793. IEEE, 2017.

[40] Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, and Carl Vondrick. Dreamitate: Real-world visuomotor policy learning via video generation, 2024.

[41] Noriaki Hirose, Fei Xia, Roberto Mart'ın-Mart'ın, Amir Sadeghian, and Silvio Savarese. Deep visual mpc-policy learning for navigation. IEEE Robotics and Automation Letters, 4(4):3184–3191, 2019b.

[42] Jing Yu Koh, Honglak Lee, Yinfei Yang, Jason Baldridge, and Peter Anderson. Pathdreamer: A world model for indoor navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14738–14748, 2021.

[43] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.

[44] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.

[45] Vikram Voleti, Alexia Jolicoeur-Martineau, and Chris Pal. Mcvd-masked conditional video diffusion for prediction, generation, and interpolation. Advances in neural information processing systems, 35:23371–23385, 2022.

[46] Han Lin, Tushar Nagarajan, Nicolas Ballas, Mido Assran, Mojtaba Komeili, Mohit Bansal, and Koustuv Sinha. Vedit: Latent prediction architecture for procedural video representation learning, 2024a.

[47] Eric R. Chan, Koki Nagano, Matthew A. Chan, Alexander W. Bergman, Jeong Joon Park, Axel Levy, Miika Aittala, Shalini De Mello, Tero Karras, and Gordon Wetzstein. Generative novel view synthesis with 3d-aware diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4217–4229, 2023.

[48] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representations.

[49] Joseph Tung, Gene Chou, Ruojin Cai, Guandao Yang, Kai Zhang, Gordon Wetzstein, Bharath Hariharan, and Noah Snavely. Megascenes: Scene-level view synthesis at scale. In Computer Vision – ECCV 2024, pages 197–214, Cham, 2025. Springer Nature Switzerland.

[50] Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347, 2019.

[51] Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang Zhao, and Junyang Lin. Understanding and improving layer normalization, 2019.

[52] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. ArXiv e-prints, pages arXiv–1607, 2016.

[53] A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017.

[54] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In Proceedings of the 38th International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.

[55] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018a.

[56] Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data. Advances in Neural Information Processing Systems, 36, 2024.

[57] Reuven Y Rubinstein. Optimization of computer simulation models with rare events. European Journal of Operational Research, 99(1):89–112, 1997.

[58] Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. Dino-wm: World models on pre-trained visual features enable zero-shot planning, 2024.

[59] Haresh Karnan, Anirudh Nair, Xuesu Xiao, Garrett Warnell, Sören Pirk, Alexander Toshev, Justin Hart, Joydeep Biswas, and Peter Stone. Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation. IEEE Robotics and Automation Letters, 7(4):11807–11814, 2022.

[60] Samuel Triest, Matthew Sivaprakasam, Sean J Wang, Wenshan Wang, Aaron M Johnson, and Sebastian Scherer. Tartandrive: A large-scale dataset for learning off-road dynamics models. In 2022 International Conference on Robotics and Automation (ICRA), pages 2546–2552. IEEE, 2022.

[61] Dhruv Shah, Benjamin Eysenbach, Gregory Kahn, Nicholas Rhinehart, and Sergey Levine. Rapid exploration for open-world navigation with latent goal models. arXiv preprint arXiv:2104.05859, 2021.

[62] Noriaki Hirose, Dhruv Shah, Ajay Sridhar, and Sergey Levine. Sacson: Scalable autonomous control for social navigation. IEEE Robotics and Automation Letters, 2023.

[63] Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.

[64] Jürgen Sturm, Wolfram Burgard, and Daniel Cremers. Evaluating egomotion and structure-from-motion approaches using the tum rgb-d benchmark. In Proc. of the Workshop on Color-Depth Camera Fusion in Robotics at the IEEE/RJS International Conference on Intelligent Robot Systems (IROS), page 6, 2012.

[65] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018b.

[66] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.

[67] Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. 2019.

[68] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.

[69] I Loshchilov. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.

[70] Hoang Thanh-Tung and Truyen Tran. Catastrophic forgetting and mode collapse in gans. In 2020 international joint conference on neural networks (ijcnn), pages 1–10. IEEE, 2020.

[71] Akash Srivastava, Lazar Valkov, Chris Russell, Michael U Gutmann, and Charles Sutton. Veegan: Reducing mode collapse in gans using implicit variational learning. Advances in neural information processing systems, 30, 2017.

[72] Benigno Uria, Borja Ibarz, Andrea Banino, Vinicius Zambaldi, Dharshan Kumaran, Demis Hassabis, Caswell Barry, and Charles Blundell. A model of egocentric to allocentric understanding in mammalian brains. bioRxiv, 2022.

[73] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.

[74] Noriaki Hirose, Amir Sadeghian, Fei Xia, Roberto Mart'ın-Mart'ın, and Silvio Savarese. Vunet: Dynamic scene view synthesis for traversability estimation using an rgb camera. IEEE Robotics and Automation Letters, 2019a.

[75] Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.

[76] Fu-Yun Wang, Zhaoyang Huang, Alexander Bergman, Dazhong Shen, Peng Gao, Michael Lingelbach, Keqiang Sun, Weikang Bian, Guanglu Song, Yu Liu, et al. Phased consistency models. Advances in Neural Information Processing Systems, 37:83951–84009, 2024.

[77] Junyu Gao, Xuan Yao, and Changsheng Xu. Fast-slow test-time adaptation for online vision-and-language navigation. In Proceedings of the 41st International Conference on Machine Learning, pages 14902–14919. PMLR, 2024.