Planning with Reasoning using Vision Language World Model

Show me an executive summary.

Context and Purpose

Organizations are developing AI systems that can plan and execute complex, multi-step tasks in the real world—from assisting users with everyday activities to controlling robots. Current approaches either rely on large language models that lack grounding in visual reality, or train on limited simulated environments that do not capture real-world diversity. This work addresses the problem of teaching AI to understand how actions change the world by learning directly from massive amounts of natural video, enabling the system to plan effectively for high-level, long-horizon tasks.

What Was Done

The team developed the Vision Language World Model (VLWM), a foundation model that learns to predict how the world evolves in response to actions by watching videos of people performing tasks. The system represents world states using natural language rather than raw pixels, making predictions interpretable and computationally efficient.

The training process involves two stages: first, videos are compressed into hierarchical "Trees of Captions" that capture both fine details and long-term progression; second, a large language model extracts structured representations showing goals, actions, and resulting world state changes. The model was trained on 180,000 videos spanning over 800 days of content, including cooking tutorials, how-to videos, and first-person recordings of daily activities.

VLWM operates in two modes. System-1 provides fast, reactive planning by directly generating action sequences. System-2 enables reflective reasoning: the model generates multiple candidate plans, simulates their outcomes, and selects the plan that minimizes the distance between the predicted final state and the desired goal. This distance is measured by a separately trained "critic" model that learns through self-supervision to assign lower costs to valid progress and higher costs to irrelevant or out-of-order actions.

Main Findings

VLWM achieved state-of-the-art performance on the Visual Planning for Assistance benchmark, with relative improvements of 20% in success rate, 10% in accuracy, and 4% in precision compared to previous best methods. On robot question-answering tasks, it outperformed strong baselines with a score of 74.2.

Human evaluations using a newly developed PlannerArena system showed that plans generated by VLWM's reflective System-2 mode were strongly preferred over those from other methods, achieving an Elo rating of 1261—significantly higher than the next best system at 1099. Notably, System-2 planning improved performance by 27% over System-1, demonstrating that internal trial-and-error reasoning produces better plans than direct generation.

The critic model independently excelled at detecting when goals are achieved, reaching 96.9% accuracy on in-domain data and maintaining 72.9% accuracy on out-of-domain planning tasks where it had never seen training examples.

What This Means

These results demonstrate that AI systems can learn robust world models from natural video at scale without requiring hand-crafted simulations or explicit reward signals. The language-based approach makes the system's reasoning transparent and allows it to leverage existing language model capabilities, while the two-mode design provides both speed and quality depending on task demands.

The strong performance on human preference evaluations indicates the system generates plans that people find genuinely useful, addressing a key gap where traditional benchmarks rely on noisy or incomplete ground-truth annotations. The System-2 reasoning capability represents a shift from simple imitation of demonstrations to active optimization, allowing the system to potentially exceed the quality of its training data.

Recommendations and Next Steps

For organizations developing AI assistants or robotic systems, VLWM provides a foundation that can be adapted to specific domains through fine-tuning. The dual-mode architecture should be deployed with System-1 for time-sensitive, straightforward tasks and System-2 when plan quality is critical or when tasks are complex and novel.

Development teams should focus on three areas: first, incorporating domain-specific constraints into the critic's cost function to enforce safety rules or task requirements; second, expanding the training data to cover additional domains where planning assistance is needed; third, developing interfaces that expose the model's interpretable reasoning to users, allowing them to understand and correct plans when needed.

Before production deployment, conduct targeted evaluations in your specific use case, as performance may vary across domains. Consider starting with lower-stakes applications where plan errors have minimal consequences.

Limitations and Confidence

The system's performance depends on the quality and coverage of training videos—domains poorly represented in the training data may see reduced accuracy. The critic model showed performance drops on tasks with action-only descriptions (no explicit world states), suggesting the full representation is important for reliable cost evaluation.

Ground-truth annotations in existing benchmarks were found to be of variable quality, making absolute performance numbers less meaningful than relative comparisons. The human evaluation involved five annotators on 550 comparisons; while inter-annotator agreement was substantial (72%), larger-scale validation would strengthen confidence in preference rankings.

The team has high confidence in the core findings: that large-scale video training enables effective world modeling, that language-based abstraction is viable for planning, and that System-2 reasoning consistently improves plan quality. Confidence is moderate on exact performance levels in new domains and on the optimal balance between System-1 and System-2 for specific applications—these will require domain-specific testing.

Delong Chen

∗^{*}

, Théo Moutakanni

∗^{*}

, Willy Chung, Yejin Bang, Ziwei Ji, Allen Bolourchi, Pascale Fung

Meta FAIR

∗^{*}

Joint first author

Abstract

Effective planning requires strong world models, but high-level world models that can understand and reason about actions with semantic and temporal abstraction remain largely underdeveloped. We introduce the Vision Language World Model (VLWM), a foundation model trained for language-based world modeling on natural videos. Given visual observations, the VLWM first infers the overall goal achievements then predicts a trajectory composed of interleaved actions and world state changes. Those targets are extracted by iterative LLM

SELF-REFINE\text{S{\scriptsize ELF-}R{\scriptsize EFINE}}

conditioned on compressed future observations represented by

CAPTIONS\text{T{\scriptsize REE} o{\scriptsize F} C{\scriptsize APTIONS}}

. The VLWM learns both an action policy and a dynamics model, which respectively facilitates reactive system-1 plan decoding and reflective system-2 planning via cost minimization. The cost evaluates the semantic distance between the hypothetical future states given by VLWM roll-outs and the expected goal state, and is measured by a critic model that we trained in a self-supervised manner. The VLWM achieves state-of-the-art Visual Planning for Assistance (VPA) performance on both benchmark evaluations and our proposed

PLANNERARENA\text{P{\scriptsize LANNER}A{\scriptsize RENA}}

human evaluations, where system-2 improves the Elo score by +27% upon system-1. The VLWM models also outperforms strong VLM baselines on RoboVQA and WorldPrediction benchmark.

Correspondence: Delong Chen mailto:[email protected], Pascale Fung mailto:[email protected]

1 Introduction

Show me a brief summary.

In this section, the authors establish that while world models have proven effective for low-level continuous control tasks like robotic manipulation and autonomous driving, learning world models for high-level task planning—where actions involve semantic and temporal abstraction—remains an open challenge that could unlock practical applications such as AI assistants in wearable devices and autonomous embodied agents. Existing approaches fall short: prompting-based methods using LLMs lack grounding in sensory experience, VLMs are trained for perception rather than action-conditioned world-state prediction, simulation-based learning cannot scale to diverse real-world activities, and pixel-based generative world models are computationally inefficient and capture task-irrelevant details. To address these limitations, the Vision Language World Model (VLWM) is introduced as a foundation model that uses natural language as an abstract world state representation, learning from massive uncurated video data to predict world evolution through language-based abstraction rather than raw pixels, enabling both reactive system-1 planning and reflective system-2 planning via cost minimization with a trained critic model.

World models enable AI agents to optimize action plans internally instead of relying on exhaustive trial-and-error in real environments ([1, 2, 3]), showing strong performance in planning across low-level, continuous control tasks such as robotic control ([4, 5, 6, 7]) and autonomous driving ([8, 9]). However, learning world models for high-level task planning – where actions involve semantic and temporal abstraction ([10, 11]) – remains an open challenge. Bridging this gap could unlock a wide range of practical applications, such as AI agents in wearable devices assisting humans in complex tasks and embodied agents capable of autonomously pursuing long-horizon goals.

To obtain a high-level world model, existing approaches fall short. Prompting-based practices ([12, 13, 14, 15]) is straightforward but inadequate as LLMs are not directly grounded in sensory experience. VLMs are primarily trained for visual perception and instead of action-conditioned prediction of world-state transitions. Meanwhile, learning from simulation environments ([16, 17]) cannot scale to divers real-world activities. Existing world models learned from natural videos often rely on generative architectures (e.g., diffusion models) to generate future observations ([18, 19, 20]). Such formulation is not only ill-posed due to partial observability and uncertainty, but also inefficient, capturing task-irrelevant details and imposing high computational costs for long-horizon roll-outs. These limitations highlight the need for world models that predict in abstract representation spaces, rather than raw pixels.

In this work, we propose to learn a world model that leverages natural language as its abstract world state representation. We introduce the Vision Language World Model (VLWM), which perceives the environment through visual observations and predicts world evolution using language-based abstraction (Figure 1). Language inherently provides semantic abstraction and is significantly more computationally efficient to generate compared to raw sensory observations. In comparison with latent embeddings in Joint Embedding Predictive Architecture (JEPA)-based world models ([1, 21, 6]), language-based abstraction is intuitive, interpretable, and enables seamless integration with prior knowledge and extensive engineering ecosystems developed for LLMs/VLMs. Compared to current LLMs/VLMs paradigms that primarily focus on perception ([22]), behavior cloning (SFT) ([23]), or reinforcement learning with verifiable rewards ([24]), we propose to perform direct world modeling as an objective based on massive, uncurated videos, i.e., reward-free offline data ([25]).

$**Figure 2:** **Overview of VLWM.** **(a)** VLWM is a JEPA-style world model that predict abstract representation of future world states, instead of generating noisy and high-volume raw observations. **(b)** Given video contexts, VLWM's prediction target is a structured textual representation of the unobserved future. It includes goal and interleaved action ($A$) world state changes ($\Delta S$), all extracted automatically. **(c)** VLWM can infer possible goals from the context, and interpret them with current initial state and the expected final state. It supports both fast reactive system-1 plan generation and reflective system-2 reasoning based on cost minimization.$

Figure 2: Overview of VLWM. (a) VLWM is a JEPA-style world model that predict abstract representation of future world states, instead of generating noisy and high-volume raw observations. (b) Given video contexts, VLWM's prediction target is a structured textual representation of the unobserved future. It includes goal and interleaved action ( $A$ ) world state changes ( $\Delta S$ ), all extracted automatically. (c) VLWM can infer possible goals from the context, and interpret them with current initial state and the expected final state. It supports both fast reactive system-1 plan generation and reflective system-2 reasoning based on cost minimization.

💭 Click to ask about this figure

An overview of the framework is shown in Figure 2. To construct training prediction targets, VLWM employs an efficient abstraction pipeline that first compresses raw video into a hierarchical

CAPTIONS\text{T{\scriptsize REE} o{\scriptsize F} C{\scriptsize APTIONS}}

, then refines it into structured goal-plan descriptions using an LLM-based

SELF-REFINE\text{S{\scriptsize ELF-}R{\scriptsize EFINE}}

([26]). The model is trained to predict these abstractions—capturing goal description, goal interpretation, actions (

A

) and world state changes (

ΔS\Delta S

) – conditioned on visual context from past observations. From this, both a predictive world model

(St,At→St+1)(S_t, A_t \rightarrow S_{t+1})

and an action policy

(St→At+1)(S_t \rightarrow A_{t+1})

are learned. It enables straightforward plan generation via text completion, using the proposed action directly as policy. We term this approach system-1 planning. However, the autoregressive nature of token decoding limits foresight and reflection, as each action decision become irreversible once made. Additionally, when training on large-scale, real-world video datasets which typically contain imperfect demonstrations, the resulting policy will also clone those suboptimal behaviors present in the data.

To unleash the full potential of VLWM, we introduce a reflective system-2 "planning with reasoning" mode. In this mode, VLWM first generates multiple roll-out based on action candidates (either proposed by itself or externally provided) and predicts resulting world states. We then search for the candidate action sequence that minimize a scalar cost, which is evaluated by a critic module that assess the desirability of candidate plans. This critic is a language model trained through a self-supervised objective: it learns to assign lower costs to valid progress toward the goal and higher costs to counterfactual or irrelevant actions, effectively measuring how closely each candidate action aligns with the desired goal state. The process of optimizing action plan by searching for a cost-minimizing candidate is a form of reasoning ([1]). It enables the agent to perform trial-and-error internally with its learned world model to obtain the optimal action plans.

The VLWM is extensively trained on a large corpus of both web instruction videos and egocentric recordings, including COIN ([27]), CrossTask ([28]), YouCook2 ([29]), HowTo100M ([30]), Ego4D ([31]), EgoExo4D ([32]), EPIC-KITCHENS-100 ([33]). Collectively, there are 180k videos spanning over 800 days of duration. We generate

CAPTIONS\text{T{\scriptsize REE} o{\scriptsize F} C{\scriptsize APTIONS}}

for each video, resulting in a total of 21M nodes of unique detailed video captions (2.7 trillion words). With iterative LLM

SELF-REFINE\text{S{\scriptsize ELF-}R{\scriptsize EFINE}}

, we extracted 1.2 million trajectories of goal-plan pairs, consisting of 5.7 million steps of actions and states. We also reformulate text-only chain-of-thought reasoning paths in NaturalReasoning ([34]) to action-state trajectories, obtaining additional 1.1 million goal-plan pairs.

Our evaluations cover both human ratings of plan preference, and quantitative results on the Visual Planning for Assistance (VPA) benchmarks ([35, 36]), achieving relative gains of +20% in SR, +10% in mAcc, and +4% in mIoU. Based on human ratings with our proposed \textsc{PlannerArena}, the procedural plans generated by VLWM system-2 mode is more preferred than prompting based methods. On the RoboVQA benchmark ([37]), VLWM achieves 74.2 BLEU-1 score, outperforming strong VLM baselines. We further evaluate critic models for goal achievement detection, and our trained critic outperform baseline semantic similarity models on both in-domain and OOD scenarios. It also established a state-of-the-art on

WORLDPREDICTION\text{W{\scriptsize ORLD}P{\scriptsize REDICTION}}

procedural planning task with 45% accuracy. Models and data will be open-sourced.

2 Methodology

Show me a brief summary.

In this section, the authors address the challenge of enabling AI agents to perform high-level task planning by developing a Vision Language World Model (VLWM) that predicts future world states using natural language abstractions rather than raw pixels. The methodology involves a two-stage approach: first, videos are compressed into hierarchical Tree of Captions through feature clustering and detailed captioning, dramatically reducing data volume while preserving semantic information; second, structured goal-plan representations—including goal descriptions, interpretations, actions, and world state changes—are extracted from these captions using iterative LLM Self-Refine. The VLWM is trained to predict these structured futures given visual context, enabling both fast System-1 reactive planning through direct text completion and reflective System-2 planning that searches over multiple candidate action sequences by minimizing costs evaluated by a self-supervised critic model, ultimately allowing the agent to reason internally and select optimal plans without real-world trial-and-error.

We aim to train a world model that understands and predicts how actions affect physical world states, and to develop a framework for reasoning and planning where the world model serves as the core component. Our approach builds on the agent architecture introduced by [1], where a reward-agnostic world model perform roll-out given candidate action plans, and the agent evaluates how closely each roll-out advances the current state toward the desired goal, and select the plan that minimizes this distance (i.e., the cost).

In the sections below, § 2.1 details how we extract structured language-based representation as future world state abstractions, which includes semantic compression techniques for efficiency considerations and quality optimization strategies. Then, § 2.2 introduces how the critic is trained to evaluate cost in a self-supervised manner and explain the system-2 plan search based on cost-minimization.

2.1 Vision-language World Modeling

Given a video, we aim to extract a structured language representation shown in Figure 2 (b), which consists of a goal (description and interpretation) and a procedural plan (action-state sequence). For such a video-to-text extraction task, one straightforward approach would be to provide a VLM with the full video and prompt it to extract the language representations. However, an impossible triangle arises: within a practical compute and memory budget, it is not feasible to simultaneously achieve 1) high spatial resolution for fine-grained perception, 2) long temporal horizon that spans many procedural steps, and 3) the use of a large and smart VLM that can follow complex instructions.

To address this challenge, we propose a two-stage strategy. First, the input video is first compressed into a dense

CAPTIONS\text{T{\scriptsize REE} o{\scriptsize F} C{\scriptsize APTIONS}}

, which significantly reduces the data volume while preserving essential semantic information (§ 2.1.1). Then, structured goal-plan representations are extracted from these captions with LLMs. Because the second stage operates purely on text, it enables efficient processing with large LLMs and allows for iterative quality refinement through

SELF-REFINE\text{S{\scriptsize ELF-}R{\scriptsize EFINE}}

(§ 2.1.2).

2.1.1 Compress Video into $CAPTIONS\text{T{\scriptsize REE} o{\scriptsize F} C{\scriptsize APTIONS}}$

Each

CAPTIONS\text{T{\scriptsize REE} o{\scriptsize F} C{\scriptsize APTIONS}}

consists of a set of video captions generated independently from different local windows of a video, collectively forming a hierarchical tree structure. It aims to holistically capture both fine-grained local details and long-horizon global information ([38]). A key challenge lies in adaptively determining the tree structure, i.e., the arrangement of different levels of windows for caption generation. Ideally, each node or leaf should correspond to a coherent monosemantic unit ([39]), avoiding span across semantic boundaries. Existing temporal action localization and segmentation models ([40]) are limited in their openness, as they rely on human annotations with closed-vocabulary action taxonomies and are typically trained on narrow video domains.

We propose to create the tree structure via hierarchical feature clustering. Specifically, let

X

be an untrimmed video, and let its feature stream be represented as

\phi(X) = [\mathbf{z}_1; \dots; \mathbf{z}_T] \in \mathbb{R}^{T \times d}

, where each

zt\mathbf{z}_t

is a

d

-dimensional feature vector produced by a video encoder

ϕ\phi

. We segment the feature stream

Z

, and accordingly the underlying video

X

, using hierarchical agglomerative clustering ([41]). Starting from the finest granularity—treating each item

zt\mathbf{z}_t

as an individual cluster—the algorithm iteratively merges adjacent segments with the smallest increase in within-segment feature variance (i.e., a measure of polysemanticity). This merging procedure is continued until there is only a single root node, and the full trace gives a hierarchical structure, where each node corresponds to a segment of the video.

The choice of

ϕ\phi

determines the behavior of the segmentation. In this paper, we adopt the Perception Encoder ([42])–a state-of-the-art model that excels at extracting scene and action information from videos. Once the hierarchical tree structure is constructed, we generate detailed captions for each video segment, excluding short segments shorter than five seconds. We use PerceptionLM ([42]) for detailed video captioning. The resulting

CAPTIONS\text{T{\scriptsize REE} o{\scriptsize F} C{\scriptsize APTIONS}}

achieves substantial compression: for instance, 1.1 TB video files in Ego4D ([31]) can be compressed to under 900 MB of caption files.

2.1.2 Extract Plans with LLM $SELF-REFINE\text{S{\scriptsize ELF-}R{\scriptsize EFINE}}$

Given the compressed

CAPTIONS\text{T{\scriptsize REE} o{\scriptsize F} C{\scriptsize APTIONS}}

extracted from the video, our next objective is to derive a structured textual representation that serves as the prediction target for VLWM. This representation includes the following four components:

Goal description is a high-level summary of the overall achievements (*e.g., * $eggs''\texttt{``cook tomato and eggs''}$ ). In downstream applications, goal descriptions given by users are typically concise (*e.g., * single sentence), omitting fine-grained details that holistically defines the final state. Therefore, explicit goal interpretations are required.
Goal interpretation includes contextual explanations that outlines both the initial and expected final world states. The initial state describes the current status of tools, materials, and dependencies, etc., providing essential grounding for plan generation. The final state interprets the goal description concretely to facilitate cost evaluation in system-2 planning. For example, $texture...''\texttt{``To achieve the goal, the eggs need to be cooked and mixed with tomatoes, and the mixture should be seasoned appropriately. The eggs should be whisked thoroughly to achieve a uniform texture...''}$
Action description are the final outputs of the system that will be passed to downstream embodiments for execution or presented to users (*e.g., * $stove''\texttt{``Preheat the skillet on the stove''}$ ). They must be clear, concise, and sufficiently informative to enable the receiver to understand and produce the intended world state transitions.
World states are internal to the system and serve as intermediate representations for reasoning and plan search. They should be a information bottleneck: sufficiently capturing all task-relevant consequences of actions while containing minimal redundancy. For example: $evenly...''\texttt{``This action prepares the skillet for cooking the eggs by increasing its temperature. The state of the skillet changes from cold to hot, making it ready for cooking. The oil used for preheating prevents the eggs from sticking to the skillet, ensuring they cook evenly...''}$ . See Appendix E.1 for more examples.

To ensure that the generated components meet these requirements, we adopt an iterative

SELF-REFINE\text{S{\scriptsize ELF-}R{\scriptsize EFINE}}

procedure ([26]), leveraging LLMs as optimizers ([43]). We begin by providing the LLM with detailed descriptions of the output requirements, examples of the expected format, and the formatted

CAPTIONS\text{T{\scriptsize REE} o{\scriptsize F} C{\scriptsize APTIONS}}

as input to generate an initial draft. In each refinement iteration, the LLM first provide a feedback to the current draft and produces a revised version accordingly. This self-refinement process is repeated for a predefined number of iterations, progressively optimizing output quality.

To input

CAPTIONS\text{T{\scriptsize REE} o{\scriptsize F} C{\scriptsize APTIONS}}

to LLMs, we format it using a depth-first search (DFS) traversal order. This linearization aligns with the hierarchical structure of textual documents that LLMs are typically trained on and familiar with (e.g., Section 1

→\rightarrow

1.1

→\rightarrow

1.1.1

→\rightarrow

1.1.2

→\rightarrow

...). In this paper, we use Llama-4 Maverick for its efficient inference and support for extended context length. Notably, the

SELF-REFINE\text{S{\scriptsize ELF-}R{\scriptsize EFINE}}

methodology is not tailored to specific LLM architecture. Below are some example feedback messages generated by Llama-4 Maverick during the

SELF-REFINE\text{S{\scriptsize ELF-}R{\scriptsize EFINE}}

process:

"Prepare the ingredients for Zucchini Curry." in the draft could be broken down into more specific actions like "Wash, peel, and chop the zucchini" and "Chop the onions and tomatoes."

The state change after sautéing the onions, ginger, garlic, and green chilies could include more details about how this step affects the overall flavor and texture of the curry.

The action of "Display the Zucchini Curry in a bowl" is more of a presentational step rather than a meaningful action that advances the task progress, so it should be removed from the steps.

2.1.3 Training of Vision Language World Model

The training task of VLWM is defined in Equation 1. Here the config acts as system prompts. The context provides environmental information and can be either visual, textual, or both. The VLWM is trained to predict the future, represented by 1) goal description along with its interpretation (i.e., the initial and expected final states), and 2) a trajectory consisting of sequence action (

A

) state (

ΔS\Delta S

) pairs. VLWM optimize the cross-entropy loss for next-token prediction on the right-hand side of Equation 1:

[\texttt{config}, \texttt{context}] \xrightarrow{\text{VLWM}} [\texttt{goal}, \; \texttt{interpretation}, \; \underbrace{ \langle A_0, \Delta S_0 \rangle, \; \dots, \; \langle A_N, \Delta S_N \rangle }_{\texttt{trajectory}}\;].\tag{1}

💭 Click to ask about this equation

(1)

This input-output formulation reflects three levels of world modeling: 1) contextual goal inference, the prediction of the possible future achievements, 2) action anticipation–proposing possible next actions, and 3) action-conditioned world state dynamics prediction. Since actions and resulting state changes are generated in an interleaved, autoregressive manner, it enables straightforward System-1 Reactive Planning through direct text completion. Given the config, context, and the goal description, VLWM interprets the goal and generates a sequence of action-state pairs until an <eos> token is reached. From a language modeling perspective, the world state descriptions act as internal chains of thought: they articulate the consequences of each action, allowing VLWM to track task progress and suggest appropriate next steps toward the goal. This planning mode is computationally efficient and well-suited for short-horizon, simple, and in-domain tasks.

Due to the

\rightarrow `text

formulation in Equation 1, pretrained VLM can be used to initialize VLWM. This provides VLWM with strong visual perception, while also enabling it to inherit language understanding and generation capabilities, and commonsense knowledge in LLMs.

2.2 Planning with Reasoning

While the System-1 mode allows fast plan generation, it lacks the capabilities of having foresight, evaluating of alternatives, or revising suboptimal decisions. Once an action is emitted, it is fixed, preventing the model from reconsidering or correcting errors. This reactive behavior can lead to error accumulation, particularly in long-horizon or complex tasks. To address these limitations, we introduce System-2 Reflective Planning, where the world model is coupled with a critic module that evaluates the desirability of multiple predicted futures given the goal. This would enable a reasoning process that involves searching for the optimal plan via cost minimization ([1]).

2.2.1 Learning the Critic from Self-supervision

In world model-based planning, the cost function typically quantifies the distance between the world state resulting from a candidate plan and the desired goal state ([21, 6]). It gives an estimation of how well the current task progress aligns with the intended goal and expected final state. In JEPA world models, this can be directly measured by L1 or L2 distance between fixed-dimensional embedding representations of world states. However, with VLWM, we must measure the semantic distance between language-based world state representations instead calculating distance in token space.

Formally, given VLWM predictions as described in Equation 1, we aim to establish a distance function

critic\mathbf{critic}

that evaluate cost

C1=critic(goal,trajectory[0:1])C_1 = \mathbf{critic}(\texttt{goal}, \texttt{trajectory}[0:1])

C2=critic(goal,trajectory[0:2])C_2 = \mathbf{critic}(\texttt{goal}, \texttt{trajectory}[0:2])

…\dots

, until

CNgold+Ndistractor=critic(goal,trajectory[0:Nbase+Ndistractor])C_{N_\text{gold}+N_\text{distractor}} = \mathbf{critic}(\texttt{goal}, \texttt{trajectory}[0:N_\text{base}+N_\text{distractor}])

.. Ideally, the cost should be low when the predicted trajectory reflects meaningful progress toward the goal, and high when it deviates due to irrelevant or erroneous actions. To model this behavior, we train a language model in a self-supervised manner, enabling it to assess the semantic quality of predicted plans without requiring explicit annotations. As shown in Figure 3(a), we explore two types of self-supervised training signals for the critic:

We construct training samples by starting from a base partial trajectory and appending either (i) valid next step(s) resulting from a coherent continuation of the task, or (ii) distractor step(s) sampled from an unrelated task. The critic independently predicts three cost scores: $CbaseC_\text{base}$ , $CgoodC_\text{good}$ , and $CbadC_\text{bad}$ and the model is trained to satisfy the ranking constraints $Cgood<Cbase<CbadC_\text{good} < C_\text{base} < C_\text{bad}$ , encouraging the critic to distinguish meaningful progress from irrelevant or misleading continuations.
We generate negative samples by randomly shuffling the steps in a base trajectory, producing a corrupted sequence with cost $CshuffledC_\text{shuffled}$ . The critic is then trained to enforce $Cbase<CshuffledC_\text{base} < C_\text{shuffled}$ , ensuring sensitivity to procedural order and temporal coherence.

The critic is trained to minimize the following ranking loss with a fixed margin, supplemented with a cost centering regularization term weighted by a small constant

λ\lambda

([44]). To construct training pairs

⟨Cpositive,Cnegative⟩\langle C_\text{positive}, C_\text{negative} \rangle

, we iterate over all three types of self-supervised signal described above:

⟨Cgood,Cbase⟩\langle C_\text{good}, C_\text{base} \rangle

⟨Cbase,Cbad⟩\langle C_\text{base}, C_\text{bad} \rangle

, and

⟨Cbase,Cshuffled⟩\langle C_\text{base}, C_\text{shuffled} \rangle

In addition to VLWM progress data, the critic formulation also supports supervision from external sources to enhance generalization. For example, preference tuning datasets-comprising triplets of a query, a preferred (chosen) response, and a rejected response—can be directly leveraged. Similarly, since the critic aims to model semantic distance, it can benefit from triplet-based datasets designed for learning sentence embeddings. These sources provide additional positive/negative pairs that can be used to further augment the training data of the critic.

2.2.2 System-2 Planning by Cost Minimization

System-2 planning involves the coordination of three components: the VLWM, the critic, and an actor. As illustrated in Figure 3(b), the actor proposes candidate action sequences, the VLWM simulates their effects, and the critic evaluates their costs. The final plan is selected by identifying the candidate sequence with the lowest predicted cost.

The actor can be instantiated either as the VLWM itself or as an external module (e.g., LLMs), particularly in cases where additional constraints on the action space or output format must be respected. The actor may vary the number of proposed candidates to control the search width or generate partial plans to enable more efficient tree search. In additional to the cost evaluated by the critic, task-specific penalties or guard-rails can be incorporated into the cost function, allowing the planner to respect external constraints, safety rules, or domain-specific preferences.

3 Experiments

Show me a brief summary.

In this section, the authors evaluate VLWM-8B and VLWM-critic-1B across multiple benchmarks to demonstrate their effectiveness in procedural planning and world modeling. On the Visual Planning for Assistance (VPA) benchmark, VLWM-8B achieves state-of-the-art results on both COIN and CrossTask datasets, outperforming existing models including the 70B VidAssist across most metrics. Human evaluation through PlannerArena reveals that VLWM System-2, which combines the world model with a cost-minimizing critic, attains the highest Elo score of 1261 and is consistently preferred over ground truth plans and leading multimodal LLMs. On RoboVQA, VLWM-8B achieves competitive performance with a BLEU-1 score of 74.2, ranking first among compared models. Intrinsic evaluations of the critic demonstrate its ability to detect goal achievement with 98.4% accuracy on in-distribution data and maintain strong performance on out-of-distribution tasks, validating the self-supervised training approach and establishing VLWM as an effective framework for vision-language world modeling and planning.

Show me a brief summary.

In this section, the authors provide supplementary details and examples that support the main findings of the VLWM paper. They describe PlannerArena, a human evaluation framework where annotators compare model-generated plans by watching video contexts and selecting preferred action sequences, achieving substantial inter-annotator agreement with a Fleiss' kappa of 0.63. The appendix includes complete prompts for the LLM-based Self-Refine process, which extracts structured action plans, world state changes, and goal interpretations from video captions while ensuring faithfulness to visual content and logical coherence across steps. An illustrative Tree-of-Captions example demonstrates the hierarchical temporal segmentation of videos into nested caption boxes. Finally, complete VLWM planning examples showcase how the model generates interpretable action-state trajectories for tasks like cooking tomato and eggs, along with cost-minimizing and cost-maximizing plans that reveal how the critic differentiates high-quality from poor-quality action sequences.

Show me a brief summary.

In this section, the references encompass a broad spectrum of research on world models, autonomous agents, and video understanding that collectively address how artificial intelligence systems can learn to predict, plan, and act in complex environments. The citations trace the evolution from foundational concepts like temporal abstraction in reinforcement learning and early world models that learn compressed representations of environments, through modern vision-language models that generate captions and understand video content, to cutting-edge applications in autonomous driving, robotic manipulation, and web navigation. Key themes include learning world dynamics from visual observations, using large language models as planners with embedded world knowledge, training agents through imitation learning from imperfect demonstrations, and developing generative models that simulate interactive environments. The references ultimately demonstrate converging efforts to build AI systems that can model physical and digital worlds with sufficient fidelity to enable robust reasoning, long-horizon planning, and effective embodied action across diverse domains.

3.1 Implementation Details

3.1.1 VLWM-8B

Sources of Videos. As summarized in Table 1, the training videos for vision-language world modeling are sourced from two primary domains: 1) Web instruction videos: COIN ([27]), CrossTask ([28]), YouCook2 ([29]), and a subset of HowTo100M ([30]) videos. These videos cover a diverse range of tasks, and provide clean expert demonstrations. 2) Egocentric recordings: EPIC-KITCHENS-100 ([45]) and EgoExo4D ([32]). These videos feature continuous, uncut recordings in realistic wearable agent scenarios. For all datasets, we collect videos from their training split. While Ego4D ([31]) is available as large-scale egocentric recordings dataset, we excluded it from training data to avoid potential overlap with benchmarks due to inconsistent train/val splitting.

Generation of Vision-language World Modeling Data. We use Perception Encoder PE-G14 ([42]) and PerceptionLM-3B ([22]) (320

×\times

320 spatial resolution, 32 frames per input – can be fit in 32GB V100) to generate the

cAPTIONS\text{T{\scriptsize REE} o{\scriptsize F} c{\scriptsize APTIONS}}

. We sample up to 5 target window per video according to the tree structure (the first 5 nodes in BFS traversal order), and use Llama-4 Maverick (mixture of 128 experts, 17B activated and 400B total parameters, FP8 precision) to extract plans from the window with the sub-tree of captions and two rounds of

SELF-REFINE\text{S{\scriptsize ELF-}R{\scriptsize EFINE}}

. Additional speech transcripts for web videos and the expert commentary in EgoExo4D are provided along with video captions to improve LLM's video understanding during plan extraction. In addition to video-based extraction, we repurposed the NaturalReasoning ([34]) dataset to world modeling by replacing

cAPTIONS\text{T{\scriptsize REE} o{\scriptsize F} c{\scriptsize APTIONS}}

with the chain-of-thoughts. Action-state trajectories are extracted by LLM

SELF-REFINE\text{S{\scriptsize ELF-}R{\scriptsize EFINE}}

with similar prompts.

Training Details. We use PerceptionLM-8B ([22]) to initialize our VLWM. The model is trained with a batch size of 128 and a maximum of 11.5k token context length. We perform uniform sampling of 32 frames in 448

^2

resolution for visual context inputs. With 12 nodes of 8

×\times

H100 GPUs, the training takes approximately 5 days.

3.1.2 VLWM-critic-1B

Data. We generate paired data according to § 2.2.1 from vision-language world modeling data of HowTo100M and NaturalReasoning. We also include

CAPTIONS\text{T{\scriptsize REE} o{\scriptsize F} C{\scriptsize APTIONS}}

data by by sampling subtrees and use root as goal and leafs as trajectories. We also incorporate off-the-shelf preference modeling data to train the critic, where the user queries are treated as goals and model responses are treated as actions. We derive

⟨Cpositive,Cnegative⟩\langle C_\text{positive}, C_\text{negative} \rangle

using using

“query”+“rejected”⟩\langle \text{``query''} + \text{``chosen''} \text{ and } \text{``query''} + \text{``rejected''} \rangle

.. We include UltraFeedback ([46]), Orca DPO pairs ([47]), Math-Step-DPO ([48]) as sources of preference data. Lastly, we incorporate training data for learning semantic similarity, where we convert triplets of

sentence⟩\langle \texttt{query}, \texttt{positive sentence}, \texttt{negative sentence} \rangle

sentences to query as goal, positive sentence as positive action and negative sentence as negative action. This type of data includes MS-MARCO ([49]), SQUAD ([50]), HotPotQA ([51]), NaturalQuestions ([52]), and FEVER ([53]).

Training Details. The critic model is initialized from Llama-3.2-1B and trained for one epoch with a batch size of 128 (2.7k steps), maximum context length of 1536 tokens using a single node of 8

×\times

H100 GPUs. For hyper-parameters in Equation 2, we set

λ\lambda

=0.01 and

margin\texttt{margin}

=1.

3.2 Visual Planning for Assistance (VPA)

3.2.1 VPA Benchmarks

To verify that VLWM's large-scale pre-training yields practical gains in procedural planning, we adopt the Visual Planning for Assistance (VPA) benchmark ([35]). VPA measures how well a model can predict the next

T

high-level steps of an ongoing activity given the video history and an explicit textual goal. We follow the standard evaluation horizons

T = 3

and

T = 4

. Experiments are conducted on two widely used instructional-video corpora for procedual planning. COIN ([27]) contains 11827 videos spanning 180 tasks, whereas CrossTask ([28]) comprises 2750 videos across 18 tasks. We adhere to the official train/val/test splits so results are directly comparable to prior work.

We benchmark VLWM against four state-of-the-art planners: DDN ([54]), LTA ([31]), VLaMP ([35]), and VidAssist ([36]), plus two frequency-based heuristics: Most-Probable (global action frequencies) and Most-Probable w/ Goal (task-conditioned frequencies). VLWM is fine-tuned on the VPA training splits of COIN and CrossTask using the same hyper-parameters as in pre-training. Following prior work, we report Success Rate (SR), Mean Accuracy (mAcc), and Mean IoU (mIoU) over the predicted step sequence, respectively measuring plan-level accuracy, step-level accuracy, and action proposal accuracy.

Table 2 confirms that VLWM sets a new state-of-the-art on the VPA benchmark. Across both COIN and CrossTask, and at both horizons

T = 3

and

T = 4

, our model consistently outperform existing baselines. Compared to VidAssit which adopts a 70B LLM, our VLWM is much smaller (8B) while achieving superior results on 8/12 metrics. Averaged over the four settings, VLWM delivers absolute gains of +3.2% in SR, +3.9% in mAcc, and +2.9 points in mIoU.

Table 2: Visual Planning for Assistance performances comparison against our finetuned VLWM.

Model	COIN T=3			COIN T=4			CrossTask T=3			CrossTask T=4
Model	SR	mAcc	mIoU	SR	mAcc	mIoU	SR	mAcc	mIoU	SR	mAcc	mIoU
Most Probable	1.6	4.3	6.8	1.6	8.2	15.3	1.7	6.1	9.9	1.3	5.5	13.9
Most Probable w/ goal	10.9	18.0	24.9	9.1	16.3	32.2	2.4	8.9	15.5	1.5	7.9	20.5
DDN ([54])	10.1	22.3	32.2	7.0	21.0	37.3	6.8	25.8	35.2	3.6	24.1	37.0

3.2.2 Human Evaluation with $PLANNERARENA\text{P{\scriptsize LANNER}A{\scriptsize RENA}}$

Traditional benchmarks for embedded AI assistants generating human-oriented plans are inadequate as they rely on biased or low-quality ground truth data, failing to capture real-world performance and human assistance. To overcome this, we created

PLANNERARENA\text{P{\scriptsize LANNER}A{\scriptsize RENA}}

, a human evaluation framework inspired by ChatbotArena ([55]). This Arena/Elo-based system involves human evaluators choosing the better plan from those generated by different anonymous models, pairwise outcomes are converted to Elo scores and model win rates. This approach aligns closely with the actual use case of AI assistants, ensuring the models we develop are not only theoretically sound but also practically valuable in the real world.

Our experimental setup includes three dataset (COIN, CrossTask and EgoExo4D), in which we compare VLWM with a search over 20 plans guided by a 8B critic that is minimizing the cost of generated plan (VLWM System-2) and a 8B critic that is maximizing cost, against leading multimodal LLMs and ground truth plans. The pairs are sampled uniformly across every possible battle configuration to have a balanced number of battles across models. The models start with an initial rating of 1000 and we use an Elo K-factor of 32 for the score updates after each battle. Five different annotators participated in the

PLANNERARENA\text{P{\scriptsize LANNER}A{\scriptsize RENA}}

evaluation evaluating a total of 550 battle pairs, with three annotators running a fixed pilot run of 90 samples to calculate inter-annotator agreement score. Additional details about

PLANNERARENA\text{P{\scriptsize LANNER}A{\scriptsize RENA}}

can be found in Appendix.

Table 3: PlannerArena results. Overall Elo score of our finetuned VLWM with a cost minimizing critic (VLWM System-2) and VLWM with a cost maximizing critic, compared to other multimodal LLMs and ground truth plans, as well as the win rate percentage of the different model on the three datasets (COIN, CrossTask and EgoExo4D) used for PlannerArena. We highlight in bold the best result score and underline the second best one

Model	# Parameters	Overall	Win Rate (%)
Model	# Parameters	Elo Score	COIN	CrossTask	EgoExo4D
VLWM System-2	8B VLWM + 1B critic	1261	87.9	70.6	87.9
Llama-4-Maverick	400B	1099	66.7	89.6	57.1
VLWM System-1*	8B VLWM	992	34.3	37.0	50.0

We show the final Elo scores of different models in Figure 4 as well as the win rate of each model per dataset. VLWM System-2 has the highest Elo by a large margin at 1261, with Llama-4-Maverick being the second most preferred model at an Elo of 1099. Despite using a critic which maximizes cost, the plans generated by VLWM Cost-maximizing (992 elo score) are still generally preferred over the ground truth and plans generated by Qwen2.5 and PerceptionLM, which struggle more to generate meaningful plans given a video context. Importantly, we see that the quality of ground truth is bad overall and has strong variation across datasets. EgoExo4D have higher quality annotations, where the ground truth plans yield the second highest win rate with 69.5% behind VLWM System-2 with 87.9%. However, in COIN and CrossTask, the ground truth plans are barely better than the worst performing models, respectively 43.6% and 42.2%, highlighting an major issue with current procedural planning datasets.

3.3 RoboVQA

To further assess VLWM’s capabilities in grounded high-level reasoning and planning, we evaluate it on the RoboVQA benchmark ([37]). RoboVQA challenges models to perform robotics-focused visual question answering in realistic, multi-embodiment settings, requiring understanding of complex visual scenes and executing coherent action sequences. This benchmark complements the procedural planning evaluations by testing VLWM’s ability to guide robotic agents effectively.

We follow the standard evaluation protocols of RoboVQA and compare VLWM’s performance using BLEU scores. We compare our model against state-of-the-art robotic LLMs: 3D-VLA-4B ([60]), RoboMamba-3B ([61]), PhysVLM-3B ([57]), RoboBrain-7B ([56]), ThinkVLA-3B and ThinkAct ([62]).

Table 4 demonstrates that VLWM achieves highly competitive performance on the RoboVQA benchmark. Despite not being specialized on robotic data like some of the top-performing models such as RoboBrain, VLWM attains strong BLEU scores across all n-gram levels, ranking within the top two models. Notably, VLWM achieves the highest BLEU-4 score of 55.6, surpassing RoboBrain’s 55.1, and closely follows it on BLEU-1 to BLEU-3. These results highlight VLWM’s robust generalization and its ability to effectively integrate visual and language information for grounded reasoning and planning in embodied settings.

3.4 Critic Evaluations

In this section, we conduct intrinsic evaluations of the critic model independently of VLWM-8B roll-outs to assess whether it exhibits the intended behavior.

3.4.1 Goal Achievement Detection

Task Definition. Given a goal and a trajectory composed of a concatenation of

NgoldN_\text{gold}

steps of reference plan that achieves the goal, and

NdistractorN_\text{distractor}

irrelevant steps appended after, the task asks the critic model to independently evaluate costs for every partial progress from the beginning, i.e.,

C1=critic(goal,trajectory[0:1])C_1 = \mathbf{critic}(\texttt{goal}, \texttt{trajectory}[0:1])

C1=critic(goal,trajectory[0:2])C_1 = \mathbf{critic}(\texttt{goal}, \texttt{trajectory}[0:2])

…\dots

, until

CNgold+Ndistractor=critic(goal,trajectory[0:Nbase+Ndistractor])C_{N_\text{gold}+N_\text{distractor}} = \mathbf{critic}(\texttt{goal}, \texttt{trajectory}[0:N_\text{base}+N_\text{distractor}])

. Since the distance to the goal should be the lowest after

NgoldN_\text{gold}

steps of reference plan, we calculate the goal achievement detection accuracy according to whether

Ngold=arg⁡min⁡[C1,...,CNgold+Ndistractor]N_\text{gold}=\arg \min [C_1, ..., C_{N_\text{gold}+N_\text{distractor}}]

Datasets. We construct testing sample from two sources. 1) Vision-language World Modeling (VLWM): 4,410 action-state trajectories extracted with

CAPTIONS\text{T{\scriptsize REE} o{\scriptsize F} C{\scriptsize APTIONS}}

and

SELF-REFINE\text{S{\scriptsize ELF-}R{\scriptsize EFINE}}

. The goal field combines both goal description and goal interpretation. Since VLWM-critic-1B is trained on HowTo100M trajectories, we exclude it and only sample data from other sources of instruction videos (COIN, CrossTask, YouCook2), and egocentric recordings (EgoExo4D, EPIC-KITCHENS-100). 2) Open Grounded Planning (OGP): [63] released a collection of planning dataset containing goal-plan pairs sourced from different domains. We only use their "robot" subsets sourced from VirtualHoom and SayCan and WikiHow subset, since plans in the tool usage subset often contain too few number of steps. Different from VLWM data, trajectories in OGP only contain actions, and are OOD for both VLWM-critic-1B and baseline models. There are only 9,983 trajectories in OGP data.

Main Results. We compare VLWM-critic-1B with Qwen3-Embedding models and Qwen3-Reranker models ([64]) as baselines, which are state-of-the-art models for measuring semantic similarity. the cost is computed as

-\texttt{sim} \langle \texttt{goal}, \texttt{trajectory}\rangle

Results are shown in Table 5. Our VLWM-critic-1B outperform baselines on most subsets by a large margin. VLWM-critic-1B gives 98.4% on VLWM-Instruct while lower 92.7% on VLWM-Ego. This is probably caused by domain gap: our critic is only trained on HowTo100M instruction videos without seeing any egocentric recording data. On OGP, our critic shows clear advantage over the best performing baseline Qwen3-Reranker-8B (72.9% vs 65.6%), but performs comparably with it on OGP-WikiHow (despite having 8

×\times

fewer parameters). Possible reasons of this smaller gap includes data noise or potential overlap between Qwen3-Reranker's training data.

In Figure 5, we visualize the normalized cost curves predicted by different critic models. The visualization can be viewed as "energy landscape", and the desired shape is to have the minimum cost at the 100% goal achievement point. On VLWM data, VLWM-ciritc-1B gives a much cleaner landscape compared to baselines. However, when comes to OGP datasets, the distribution becomes nosier. Despite domain gap and dataset noise problem mentioned above, one potential reasoning of performance degradation is the OGP gives action-only trajectory without any explicit world state descriptions, which makes cost evaluation harder.

Table 6: Ablation of goal and trajectory representation. We ablate goal interpretation or world state change descriptions from VLWM-critic-1B's input. Both of them leads to consistent performance reduction across all subsets, and the drop is more significant on the Ego subset, showing the effectiveness of interpretation and states in facilitating generalization.

Dataset		Default	w/o Interp.	w/o states
Instruct	COIN	97.1	96.4 (-0.7)	91.4 (-5.7)
	CrossTask	98.8	98.5 (-0.3)	92.9 (-5.9)
	YouCook2	99.2	99.1 (-0.1)	94.5 (-4.7)
Ego	EgoExo4D	95.2	94.0 (-1.2)	82.2 (-13.0)

Ablation Studies. Table 6 provides an ablation of critic input representation using VLWM-critic-1B and the VLWM data. We tried to remove the goal interpretations which contains descriptions of current and expected final goal state, and state descriptions from the trajectory representation and leave actions only. We see both ablation leads to performance reduction on goal achievement detection, and the reduction on unseen OOD data (the Ego subset) is more severe, showing the importance of interpretation and world state description for effective generalization.

3.4.2 Procedural Planning on $WORLDPREDICTION-PP\text{W{\scriptsize ORLD}P{\scriptsize REDICTION-PP}}$

The

WORLDPREDICTION\text{W{\scriptsize ORLD}P{\scriptsize REDICTION}}

benchmark ([11]) is designed to evaluate high-level world modeling and procedural planning capabilities. Its procedural planning subset,

WORLDPREDICTION-PP\text{W{\scriptsize ORLD}P{\scriptsize REDICTION-PP}}

, comprises 570 human verified samples. Each test case provides initial and final visual states alongside four candidate action plans, represented by video sequences. The task is to identify the correctly ordered sequence among shuffled counterfactual distractors, emphasizing the capability of goal-conditioned planning as well as models' understanding of semantic and temporal action order.

To evaluate our critic modules on

WORLDPREDICTION-PP\text{W{\scriptsize ORLD}P{\scriptsize REDICTION-PP}}

, we followed the evaluation protocol for Socratic LLMs in ([11]). Visual inputs were first converted into textual descriptions using captions generated by Qwen2.5-VL. Specifically, two images depicting initial and final states produced a goal description outlining the changes of world states, and video clips of candidate actions were similarly captioned. These textual inputs were provided directly to our VLWM-critic models to compute costs for each candidate plan, selecting the option with the lowest predicted cost. In Figure 6 (b), we compare our VLWM-critic models against baseline Socratic LLMs. Our models achieve a Pareto-optimal balance of model size and accuracy. Importantly, this evaluation constitutes a zero-shot scenario for VLWM-critic models, as neither the change captioning-based goal descriptions nor the detailed video captions as action steps were part of the training corpus.

4 Related Work

Show me a brief summary.

In this section, the authors position VLWM within the landscape of action planning and world modeling approaches. Action planning methods fall into three categories: imitation learning, which struggles with scarce or imperfect demonstrations particularly in procedural video tasks; reinforcement learning, which requires interactive environments with explicit rewards and doesn't scale to diverse domains; and planning with reward-agnostic world models, which learns from extensive offline data and optimizes plans through internal simulation by minimizing distance to goal states rather than predicting task-specific rewards. World modeling approaches similarly divide into generative models that reconstruct pixel-level observations but suffer from computational inefficiency and task-irrelevant details; JEPA models that predict in compact latent spaces but face training challenges and focus mainly on low-level control; and language-based world models that use natural language as an interpretable high-level abstraction. VLWM advances this last paradigm by learning directly from large-scale raw videos rather than relying on prompting existing LLMs or training in narrow domains.

4.1 Action Planning

Planning is the task to generate a sequence of actions that can transit the world from initial state to a desired goal state. Our VLWM focuses on planning high-level actions, which is characterized by semantic and temporal abstraction ([10, 11]), as opposed to the low-level, high-frequency continuous actions in autonomous driving ([65]), robotics ([66]), and games ([67, 68]), etc. Below, we compare existing methodologies for action planning.

Imitation learning (also known as behavior cloning) is effective when extensive expert demonstrations are available ([69, 70]). However, it becomes considerably more challenging when demonstrations are scarce or imperfect ([71, 72]). For procedural planning and VPA tasks based on instructional videos ([54, 35]), most approaches rely fundamentally on behavior cloning. Since the action annotations ([27, 28]) are confined to limited vocabularies, the ground truth plans are frequently incomplete, making them not only suboptimal reference for benchmarking (which motivates our PlannerArena in § 3.2.2), but also inadequate for imitation learning.

Reinforcement learning typically requires environments where agents can perform trial-and-error and receive explicit rewards. When environments support such interactions, reinforcement learning verifiable rewards (RLVR) is highly effective ([73]). Although RL is well-suited for domains where constructing simulation environments is viable, scaling RL to more diverse and complex domains is less feasible.

Planning with reward-agnostic world model. This approach exhibits superior generalization by learning from extensive, reward-free offline data ([25, 21, 6]). World models enable planning by simulating action outcomes internally and optimizing plans based on cost minimization. Unlike methods that predict task-specific rewards (i.e., model-based RL ([16])), here world models only predict future world states ([2]), and action plans are optimized by minimizing the distance between the predicted resulting state and the desired goal state ([1]). It allows inference-time scaling by conducting internal trial-and-error within the learned world model. Our VLWM's system 2 "planning with reasoning" leverages this paradigm, and we proved that it outperforms reactive system-1 behavior cloning.

4.2 World Modeling

World models aim to simulate environmental dynamics, enabling agents to optimize the plan without direct online interaction with the real environment. They have demonstrated success primarily in low-level control domains, such as autonomous driving ([9, 74, 75]) and robotics ([21]), where models predict fine-grained, continuous sensory data over short horizons. Below, we compare existing world modeling approaches.

Generative world models typically utilize powerful diffusion-based architectures to reconstruct future observations directly (e.g., in pixel space). Examples include Sora ([19]), Cosmos ([76]), Genie ([77, 78]) and UniSim ([5]), and recent multimodal chain-of-thought reasoning i.e., "thinking with images" models ([79]). While intuitive, generative models inherently suffer from computational inefficiency and task-irrelevant details entangled in pixel-based representations, severely limiting their scalability for long-horizon planning. While these models generate realistic visuals, they have shown limited success in planning tasks.

JEPA world models encode observations into compact abstract representations, with a predictor trained to forecast these latent states. JEPA models have proven beneficial in representation learning, demonstrated by I-JEPA ([80]), IWM ([81]), and V-JEPA ([82]), and have facilitated MPC-based planning, exemplified by DINO-WM ([21]), V-JEPA2 ([6]), and NWM ([83]). However, joint training of encoders and predictors poses challenges, notably the need for anti-collapse techniques such as EMA. Moreover, existing JEPA-based world models predominantly focus on low-level motion planning, and extending them to high-level action planning remains an open research challenge.

Language-based world models exploit natural language as a high-level abstraction interface, offering interpretability and computational advantages over pixel-based reconstruction. Prior work has explored prompting LLMs as world models ([12, 13, 14, 15]) or training language-based world model in narrowed domains, such as web navigation ([84]), text games ([85, 86]), and in embodied environment ([87]). In contrast, our VLWM approach explicitly learns a world model directly from large-scale raw video data.

5 Conclusion

Show me a brief summary.

In this section, the Vision Language World Model (VLWM) is presented as a foundation model that addresses the challenge of enabling AI systems to perform interpretable and efficient high-level planning by learning world dynamics directly in language space. The approach compresses raw videos into hierarchical Trees of Captions, which are refined into structured trajectories containing goals, actions, and world state changes, thereby bridging perception-driven vision-language models with reasoning-oriented language models. The system operates in dual modes: fast, reactive System-1 planning through direct policy decoding, and reflective System-2 planning via cost minimization guided by a self-supervised critic that enables internal trial-and-error reasoning. Trained on diverse instructional and egocentric videos, VLWM achieves state-of-the-art performance on Visual Planning for Assistance, PlannerArena human evaluations, and RoboVQA benchmarks while generating interpretable outputs. By predicting in abstract language representations rather than pixels, VLWM advances AI assistants beyond mere imitation toward reflective agents capable of robust, long-horizon decision-making.

In this work, we introduced the Vision Language World Model (VLWM), a foundation model that learns to represent and predict world dynamics directly in language space, enabling interpretable and efficient high-level planning. By compressing raw videos into hierarchical Trees of Captions and refining them into structured trajectories of goals, actions, and world state changes, VLWM bridges the gap between perception-driven VLMs and reasoning-oriented LLMs. Its dual-mode design supports both fast, reactive System-1 planning through direct policy decoding and reflective System-2 planning via cost minimization guided by a self-supervised critic, which allows the model to internally perform trial-and-error reasoning and select optimal plans. Trained on a large and diverse corpus of instructional and egocentric videos, VLWM establishes new state-of-the-art results on the Visual Planning for Assistance benchmark, demonstrates superior plan quality in \textsc{PlannerArena} human preference evaluations, and achieves top-tier performance on RoboVQA, all while producing interpretable action-state rollouts. Furthermore, the critic model independently excels in goal achievement detection and procedural planning benchmarks, highlighting the value of explicit semantic cost modeling for world-model-based reasoning. Taken together, these contributions show that by learning directly from large-scale natural videos and predicting in abstract, non-generative representation spaces rather than raw pixels, Vision Language World Model (VLWM) can provide a powerful interface for bridging perception, reasoning, and planning, pushing AI assistants beyond imitation toward reflective agents capable of robust, long-horizon decision making.

Appendix

A PlannerArena Details

A.1 Instructions & data

To evaluate model-generated plans, we conducted a controlled human evaluation study using a custom-built streamlit application. Annotators were presented with (i) a short video context, (ii) a textual goal (e.g., Make a fish curry), and (iii) two alternative plans generated by different anonymous models. The task is to select the preferred plan to achieve the goal given the provided video context. The instruction shown to annotators is:

The evaluation setup is based on three datasets commonly used for procedural video planning and understanding: COIN, CrossTask, and EgoExo4D. For all datasets, the video context given to the annotators is the entire original video truncated right before the start of the first annotated step in order to prevent models from leveraging future visual information in their plan. This is similar to the Visual Planning for human Assistance (VPA) setup, but in order to evaluate human plan preference. For EgoExo4D, the exo point of view is given as video context to prevent any partial observation problems.

We generate candidate plans with the other VLMs with zero-shot prompting, all models are provided with the same video context and were prompted with the following template:

\rule{\linewidth}{1pt}

language=Markdown
You are provided with a context segment of a procedural video about {goal_formatted}. Generate the remaining actions (steps) to take from that context segment in order to reach the goal. The plan should be composed of high-level descriptions starting with a verb, and it should be clear and concise, including all essential information. There is no need to be overly descriptive. Generate only the action steps.

\rule{\linewidth}{1pt}

A.2 Pairs sampling & IAA

Unlike ChatbotArena which relies on an Elo-based sampling method to balance the evaluation across a large number of models, we adopt a uniform uniform sampling strategy as we only have six models to compare. Specifically, we first sample an equal number of battle pairs from each dataset, then enforce balanced participation across models such that each model competed equally against others within each dataset. A “setup” is defined as a (dataset, model pair) combination, and each setup is represented equally in the sample pool, yielding 3500 unique battle setups for \textsc{PlannerArena}.

Five annotators participated in the study. Prior to annotation, they completed a short warm-up consisting of five solved examples to familiarize themselves with the task. Inter-annotator agreement is computed over a shared subset of 100 samples with three annotators: the Fleiss’

K

was 0.63, indicating substantial agreement, with a raw agreement percentage of 72.22%.

A.3 Example

B Prompts

B.1 Meta Prompt for LLM $SELF-REFINE\text{S{\scriptsize ELF-}R{\scriptsize EFINE}}$

\rule{\linewidth}{1pt}

language=Markdown
{TREE OF CAPTIONS} {ADDITIONAL VIDEO INFO}

## Draft

Here is a draft for structured data extraction:

{PREVISOUS DRAFT}

—

'''yaml discussion: |- Free form chain-of-thought reasoning: analyze the draft, identify problems, and suggest actionable revisions or enrichments. plan:

action: state: |- start: xx.xx # float between <min_start> and <max_end> round to two decimal digits end: xx.xx # float between <min_start> and <max_end> round to two decimal digits
action: state: |- ... goal: interpretation: |-

Start your response with "'''yaml\n..." and end with "\n'''"

B.2 Requirements of Plan Extraction for LLM $SELF-REFINE\text{S{\scriptsize ELF-}R{\scriptsize EFINE}}$

\rule{\linewidth}{1pt}

**Action Plan**

Identify a sequence of physical actions that meaningfully advance the task progress; Omit vague, redundant, or purely presentational steps.
Each action is one informative imperative sentence said from the actor's perspective. Avoid describing actions from the tutor's or demonstrator's voice.
Infer the span of each action according the provided timestamps. They must fall within <min_start> and <max_end> and do not overlap with each other.
Be selective - time in the video may be non-linear. For example, the final result may appear at the beginning of the video. Such actions should be skipped.

**World State**
Explain how the action is performed according to the provided captions. Use imperative voice and instructional or tutoral style.
Provide elaborated discussion of the motivation, rationale, and purpose behind the action.

C Tree-of-Captions Example

Tree-of-Captions formate by depth-first search (DFS):

language=Markdown

## 0.00s -> 164.53s (duration: 164.5s)

The video features a view of a man repairing a bicycle tire and tube. The man is wearing black gloves, and there is a bicycle lift holding a blue bike in the background. In the background is another person wearing a gray shirt. A black tool chest and a wooden tool bench can also be seen ...

### D. Segment 1 - 0.00s -> 126.20s (duration: 126.2s)

This video features a man showing a second man how to repair a tire. The second man stands center screen in a workshop. The man holding the tire is center screen and behind him is a large workbench. In the background are several other objects of various sizes and shapes ...

D VLWM Planning Examples

D.1 Full Trajectory

html
<!------------ VLWM Input ------------>

<GOAL>
Cooking Tomato and Eggs
</GOAL>

html
<!------------ VLWM Output ------------>

<INTERPRETATION>
Now, the kitchen is set up with necessary ingredients and tools. To achieve the goal, the eggs need to be cooked and mixed with tomatoes, and the mixture should be seasoned appropriately. The eggs should be whisked thoroughly to achieve a uniform texture. The tomatoes should be heated until they are soft and well coated with the eggs. The dish should be seasoned with salt and pepper to taste. The final state of the dish should be a well-cooked, flavorful mixture of eggs and tomatoes, ready for serving. The kitchen environment and the tools used, such as the stove, pan, and utensils, are crucial for achieving this goal. The ingredients, including eggs, tomatoes, oil, salt, and pepper, are essential for creating the desired dish.
</INTERPRETATION>

---

<ACTION>

D.2 Cost Min/maximizing Plans

language=Markdown

##### Goal: Make Beef Tacos

Cost Minimizing Plan (cost = -4.86):
  1. Saute chopped onions and garlic in oil
  2. Add ground beef to the pan and break it up
  3. Add seasoning to the beef mixture
  4. Add tomato sauce to the beef mixture and stir
  5. Prepare taco shells 

References

[1] Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. 2022.

[2] David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.

[3] Pascale Fung, Yoram Bachrach, Asli Celikyilmaz, Kamalika Chaudhuri, Delong Chen, Willy Chung, Emmanuel Dupoux, Hongyu Gong, Hervé Jégou, Alessandro Lazaric, et al. Embodied ai agents: Modeling the world. arXiv preprint arXiv:2506.22355, 2025.

[4] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.

[5] Sherry Yang, Yilun Du, Seyed Kamyar Seyed Ghasemipour, Jonathan Tompson, Leslie Pack Kaelbling, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. In The Twelfth International Conference on Learning Representations, 2024.

[6] Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985, 2025.

[7] Minting Pan, Yitao Zheng, Jiajian Li, Yunbo Wang, and Xiaokang Yang. Video-enhanced offline reinforcement learning: A model-based approach. arXiv preprint arXiv:2505.06482, 2025.

[8] Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023.

[9] Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14749–14759, 2024c.

[10] Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.

[11] Delong Chen, Willy Chung, Yejin Bang, Ziwei Ji, and Pascale Fung. Worldprediction: A benchmark for high-level world modeling and long-horizon procedural planning. arXiv preprint arXiv:2506.04363, 2025.

[12] Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. Reasoning with language model is planning with world model. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8154–8173, 2023.

[13] Hao Tang, Darren Key, and Kevin Ellis. Worldcoder, a model-based llm agent: Building world models by writing code and interacting with the environment. Advances in Neural Information Processing Systems, 37:70148–70212, 2024.

[14] Ruoyao Wang, Graham Todd, Ziang Xiao, Xingdi Yuan, Marc-Alexandre Côté, Peter Clark, and Peter Jansen. Can language models serve as text-based world simulators? In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–17, 2024b.

[15] Yu Gu, Kai Zhang, Yuting Ning, Boyuan Zheng, Boyu Gou, Tianci Xue, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, et al. Is your llm secretly a world model of the internet? model-based planning for web agents. arXiv preprint arXiv:2411.06559, 2024.

[16] Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models, 2024. https://arxiv.org/abs/2301.04104.

[17] Siyin Wang, Zhaoye Fei, Qinyuan Cheng, Shiduo Zhang, Panpan Cai, Jinlan Fu, and Xipeng Qiu. World modeling makes a better planner: Dual preference optimization for embodied task planning. CoRR, abs/2503.10480, 2025b. doi:10.48550/ARXIV.2503.10480. https://doi.org/10.48550/arXiv.2503.10480.

[18] Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. arXiv preprint arXiv:2310.06114, 1(2):6, 2023b.

[19] Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. https://openai.com/research/video-generation-models-as-world-simulators.

[20] Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, 2025b.

[21] Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. Dino-wm: World models on pre-trained visual features enable zero-shot planning. arXiv preprint arXiv:2411.04983, 2024.

[22] Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Jain, et al. Perceptionlm: Open-access data and models for detailed visual understanding. arXiv preprint arXiv:2504.13180, 2025.

[23] Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang. Agenttuning: Enabling generalized agent abilities for llms. arXiv preprint arXiv:2310.12823, 2023.

[24] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.

[25] Vlad Sobal, Wancong Zhang, Kynghyun Cho, Randall Balestriero, Tim GJ Rudner, and Yann LeCun. Learning from reward-free offline data: A case for planning with latent dynamics models. arXiv preprint arXiv:2502.14819, 2025.

[26] Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36:46534–46594, 2023.

[27] Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. Coin: A large-scale dataset for comprehensive instructional video analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1207–1216, 2019.

[28] Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, and Josef Sivic. Cross-task weakly supervised learning from instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3537–3545, 2019.

[29] Luowei Zhou, Chenliang Xu, and Jason Corso. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.

[30] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019.

[31] Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022.

[32] Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19383–19400, 2024.

[33] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Scaling egocentric vision: The epic-kitchens dataset. In European Conference on Computer Vision (ECCV), 2018.

[34] Weizhe Yuan, Jane Yu, Song Jiang, Karthik Padthe, Yang Li, Dong Wang, Ilia Kulikov, Kyunghyun Cho, Yuandong Tian, Jason E Weston, et al. Naturalreasoning: Reasoning in the wild with 2.8 m challenging questions. arXiv preprint arXiv:2502.13124, 2025.

[35] Dhruvesh Patel, Hamid Eghbalzadeh, Nitin Kamra, Michael Louis Iuzzolino, Unnat Jain, and Ruta Desai. Pretrained language models as visual planners for human assistance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15302–15314, 2023.

[36] Md Mohaiminul Islam, Tushar Nagarajan, Huiyu Wang, Fu-Jen Chu, Kris Kitani, Gedas Bertasius, and Xitong Yang. Propose, assess, search: Harnessing llms for goal-oriented planning in instructional videos. In European Conference on Computer Vision, pages 436–452. Springer, 2024.

[37] Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, et al. Robovqa: Multimodal long-horizon reasoning for robotics. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 645–652. IEEE, 2024.

[38] Delong Chen, Samuel Cahyawijaya, Etsuko Ishii, Ho Shu Chan, Yejin Bang, and Pascale Fung. What makes for good image captions? arXiv preprint arXiv:2405.00485, 2024a.

[39] Delong Chen, Samuel Cahyawijaya, Jianfeng Liu, Baoyuan Wang, and Pascale Fung. Subobject-level image tokenization. arXiv preprint arXiv:2402.14327, 2024b.

[40] Guodong Ding, Fadime Sener, and Angela Yao. Temporal action segmentation: An analysis of modern techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(2):1011–1030, 2023.

[41] Fionn Murtagh and Pedro Contreras. Algorithms for hierarchical clustering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1):86–97, 2012.

[42] Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the network. arXiv preprint arXiv:2504.13181, 2025.

[43] Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023a.

[44] Abhishek Naik, Yi Wan, Manan Tomar, and Richard S Sutton. Reward centering. arXiv preprint arXiv:2405.09999, 2024.

[45] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision, pages 1–23, 2022.

[46] Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, et al. Ultrafeedback: Boosting language models with scaled ai feedback. arXiv preprint arXiv:2310.01377, 2023.

[47] Wing Lian, Bleys Goodson, Eugene Pentland, Austin Cook, Chanvichet Vong, and "Teknium". Openorca: An open dataset of gpt augmented flan reasoning traces. https://https://huggingface.co/datasets/Open-Orca/OpenOrca, 2023.

[48] Xin Lai, Zhuotao Tian, Yukang Chen, Senqiao Yang, Xiangru Peng, and Jiaya Jia. Step-dpo: Step-wise preference optimization for long-chain reasoning of llms. arXiv preprint arXiv:2406.18629, 2024.

[49] Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268, 2016.

[50] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.

[51] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018.

[52] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.

[53] James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. Fever: a large-scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355, 2018.

[54] Chien-Yi Chang, De-An Huang, Danfei Xu, Ehsan Adeli, Li Fei-Fei, and Juan Carlos Niebles. Procedure planning in instructional videos. In European Conference on Computer Vision, pages 334–350. Springer, 2020.

[55] Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. In Forty-first International Conference on Machine Learning, 2024.

[56] Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, Xinda Xue, Qinghang Su, Huaihai Lyu, Xiaolong Zheng, Jiaming Liu, Zhongyuan Wang, and Shanghang Zhang. Robobrain: A unified brain model for robotic manipulation from abstract to concrete, 2025. https://arxiv.org/abs/2502.21257.

[57] Weijie Zhou, Manli Tao, Chaoyang Zhao, Haiyun Guo, Honghui Dong, Ming Tang, and Jinqiao Wang. Physvlm: Enabling visual language models to understand robotic physical reachability, 2025. https://arxiv.org/abs/2503.08481.

[58] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024a.

[59] Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer, 2024a. https://arxiv.org/abs/2408.03326.

[60] Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: A 3d vision-language-action generative world model, 2024. https://arxiv.org/abs/2403.09631.

[61] Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Pengju An, Xiaoqi Li, Kaichen Zhou, Senqiao Yang, Renrui Zhang, Yandong Guo, and Shanghang Zhang. Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation. In The Thirty-eighth Annual Conference on Neural Information Processing Systems.

[62] Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, and Fu-En Yang. Thinkact: Vision-language-action reasoning via reinforced visual latent planning. arXiv preprint arXiv:2507.16815, 2025.

[63] Shiguang Guo, Ziliang Deng, Hongyu Lin, Yaojie Lu, Xianpei Han, and Le Sun. Open grounded planning: Challenges and benchmark construction. arXiv preprint arXiv:2406.02903, 2024.

[64] Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models. arXiv preprint arXiv:2506.05176, 2025.

[65] Siyu Teng, Xuemin Hu, Peng Deng, Bai Li, Yuchen Li, Yunfeng Ai, Dongsheng Yang, Lingxi Li, Zhe Xuanyuan, Fenghua Zhu, et al. Motion planning for autonomous driving: The state of the art and future perspectives. IEEE Transactions on Intelligent Vehicles, 8(6):3692–3711, 2023.

[66] Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pages 1094–1100. PMLR, 2020.

[67] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.

[68] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.

[69] Faraz Torabi, Garrett Warnell, and Peter Stone. Recent advances in imitation learning from observation. arXiv preprint arXiv:1905.13566, 2019.

[70] Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems, 35:24639–24654, 2022.

[71] Yueh-Hua Wu, Nontawat Charoenphakdee, Han Bao, Voot Tangkaratt, and Masashi Sugiyama. Imitation learning from imperfect demonstration. In International Conference on Machine Learning, pages 6818–6827. PMLR, 2019.

[72] Shahabedin Sagheb and Dylan P Losey. Counterfactual behavior cloning: Offline imitation learning from imperfect human demonstrations. arXiv preprint arXiv:2505.10760, 2025.

[73] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.

[74] Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.

[75] Qifeng Li, Xiaosong Jia, Shaobo Wang, and Junchi Yan. Think2drive: Efficient reinforcement learning by thinking with latent world model for autonomous driving (in carla-v2). In European Conference on Computer Vision, pages 142–158. Springer, 2024b.

[76] Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, 2025a.

[77] Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning, 2024.

[78] Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Frederic Besse, Tim Harley, Anna Mitenkova, Jane Wang, Jeff Clune, Demis Hassabis, Raia Hadsell, Adrian Bolton, Satinder Singh, and Tim Rocktäschel. Genie 2: A large-scale foundation world model. 2024. https://deepmind.google/discover/blog/genie-2-a-large-scale-foundation-world-model/.

[79] Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers. arXiv preprint arXiv:2506.23918, 2025.

[80] Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023.

[81] Quentin Garrido, Mahmoud Assran, Nicolas Ballas, Adrien Bardes, Laurent Najman, and Yann LeCun. Learning and leveraging world models in visual representation learning. arXiv preprint arXiv:2403.00504, 2024.

[82] Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video. Transactions on Machine Learning Research, 2024.

[83] Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 15791–15801, 2025.

[84] Hyungjoo Chae, Namyoung Kim, Kai Tzu-iunn Ong, Minju Gwak, Gwanwoo Song, Jihoon Kim, Sunghwan Kim, Dongha Lee, and Jinyoung Yeo. Web agents with world models: Learning and leveraging environment dynamics in web navigation. arXiv preprint arXiv:2410.13232, 2024.

[85] Jessy Lin, Yuqing Du, Olivia Watkins, Danijar Hafner, Pieter Abbeel, Dan Klein, and Anca Dragan. Learning to model the world with language. arXiv preprint arXiv:2308.01399, 2023.

[86] Jialong Wu, Shaofeng Yin, Ningya Feng, and Mingsheng Long. Rlvr-world: Training world models with reinforcement learning. arXiv preprint arXiv:2505.13934, 2025.

[87] Siyin Wang, Zhaoye Fei, Qinyuan Cheng, Shiduo Zhang, Panpan Cai, Jinlan Fu, and Xipeng Qiu. World modeling makes a better planner: Dual preference optimization for embodied task planning. arXiv preprint arXiv:2503.10480, 2025a.

Model	BLEU-1
PerceptionLM-8B ([22])	14.2
Qwen2-VL-7B* ([58])	33.2
GPT-4V*	32.2
LLaVA-OV-7B* ([59])	38.1

Planning with Reasoning using Vision Language World Model

Abstract

1 Introduction

2 Methodology

2.1 Vision-language World Modeling

2.1.1 Compress Video into TREE oF CAPTIONS\text{T{\scriptsize REE} o{\scriptsize F} C{\scriptsize APTIONS}}TREE oF CAPTIONS

2.1.2 Extract Plans with LLM SELF-REFINE\text{S{\scriptsize ELF-}R{\scriptsize EFINE}}SELF-REFINE

2.1.3 Training of Vision Language World Model

2.2 Planning with Reasoning

2.2.1 Learning the Critic from Self-supervision

2.2.2 System-2 Planning by Cost Minimization

3 Experiments

3.1 Implementation Details

3.1.1 VLWM-8B

3.1.2 VLWM-critic-1B

3.2 Visual Planning for Assistance (VPA)

3.2.1 VPA Benchmarks

3.2.2 Human Evaluation with PLANNERARENA\text{P{\scriptsize LANNER}A{\scriptsize RENA}}PLANNERARENA

3.3 RoboVQA

3.4 Critic Evaluations

3.4.1 Goal Achievement Detection

3.4.2 Procedural Planning on WORLDPREDICTION-PP\text{W{\scriptsize ORLD}P{\scriptsize REDICTION-PP}}WORLDPREDICTION-PP

4 Related Work

4.1 Action Planning

4.2 World Modeling

5 Conclusion

Appendix

A PlannerArena Details

A.1 Instructions & data

\textsc{PlannerArena} Instruction

A.2 Pairs sampling & IAA

A.3 Example

B Prompts

B.1 Meta Prompt for LLM SELF-REFINE\text{S{\scriptsize ELF-}R{\scriptsize EFINE}}SELF-REFINE

B.2 Requirements of Plan Extraction for LLM SELF-REFINE\text{S{\scriptsize ELF-}R{\scriptsize EFINE}}SELF-REFINE

C Tree-of-Captions Example

D VLWM Planning Examples

D.1 Full Trajectory

D.2 Cost Min/maximizing Plans

References

2.1.1 Compress Video into $CAPTIONS\text{T{\scriptsize REE} o{\scriptsize F} C{\scriptsize APTIONS}}$

2.1.2 Extract Plans with LLM $SELF-REFINE\text{S{\scriptsize ELF-}R{\scriptsize EFINE}}$

3.2.2 Human Evaluation with $PLANNERARENA\text{P{\scriptsize LANNER}A{\scriptsize RENA}}$

3.4.2 Procedural Planning on $WORLDPREDICTION-PP\text{W{\scriptsize ORLD}P{\scriptsize REDICTION-PP}}$

B.1 Meta Prompt for LLM $SELF-REFINE\text{S{\scriptsize ELF-}R{\scriptsize EFINE}}$

B.2 Requirements of Plan Extraction for LLM $SELF-REFINE\text{S{\scriptsize ELF-}R{\scriptsize EFINE}}$