CoA-VLA: Improving Vision-Language-Action Models via Visual-Textual Chain-of-Affordance

Show me an executive summary.

# Executive Summary

Purpose and context

This research addresses a critical gap in robot learning: current Vision-Language-Action (VLA) models struggle with complex, multi-step manipulation tasks because they lack explicit reasoning capabilities. While these models can learn from large datasets, they often fail when tasks require understanding *what* to manipulate, *how* to grasp it, *where* to place it, and *how* to move safely. The work introduces Chain-of-Affordance (CoA-VLA), a new approach that teaches robots to reason through these decisions step-by-step before acting, similar to how OpenAI's O1 model uses reasoning chains to solve complex problems in language tasks.

What was done

The team developed CoA-VLA by enhancing an existing robot foundation model (DiffusionVLA) with four types of structured reasoning called "affordances": object affordance (identifying what to manipulate and where it is), grasp affordance (determining the best grip point), spatial affordance (finding safe placement locations), and movement affordance (planning collision-free paths). These affordances are represented in two formats—textual descriptions and visual overlays on camera images—and integrated into the robot's decision-making process through a new visual-language co-injection module. To train the system efficiently, the researchers built an automated pipeline using AI tools (GPT-4o, SAM, Grounding DINO, CoTracker, RoboPoint) to generate large-scale affordance annotations from robot demonstration data, avoiding costly manual labeling.

Key findings

CoA-VLA substantially outperformed state-of-the-art robot models in both simulation and real-world tests. On seven real-world manipulation tasks using a Franka robot arm, CoA-VLA achieved an 85.5% average success rate in standard conditions, compared to 76.6% for the baseline DiffusionVLA and 54.9% for OpenVLA—despite using a smaller model and less training data. When tested under challenging visual conditions (distractors, varied lighting, cluttered backgrounds), CoA-VLA's advantage widened to 57.1% success versus 44.4% for DiffusionVLA and 22.2% for OpenVLA. In the LIBERO simulation benchmark across 40 tasks, CoA-VLA reached 79.8% average success, exceeding OpenVLA's 76.5%. The model demonstrated strong generalization to novel scenarios: it successfully identified free space on crowded plates, avoided unexpected obstacles during motion, and grasped objects in unfamiliar orientations not seen during training.

What the results mean

These results demonstrate that explicit affordance-based reasoning significantly improves robot reliability and adaptability, particularly in unstructured or changing environments. The performance gains translate to fewer task failures, reduced risk of collisions and damage, and better handling of real-world variability—all critical for deploying robots outside controlled laboratory settings. The success with visual generalization suggests the approach reduces the need for extensive retraining when environments change, potentially lowering deployment costs and accelerating robot system updates. The ability to reason about spatial constraints and movement paths also directly addresses safety concerns in human-shared spaces.

Recommendations and next steps

Organizations developing or deploying robot manipulation systems should consider integrating affordance-based reasoning into their VLA architectures, particularly for applications requiring robust performance in variable environments (warehouses, kitchens, medical facilities, field operations). For immediate adoption, focus on tasks where spatial reasoning and obstacle avoidance are critical. The automated affordance annotation pipeline should be scaled to additional robot datasets to expand the training base. Further development should investigate extending the affordance types to handle tool use, deformable objects, and multi-robot coordination. Before full production deployment, conduct pilot studies in target operational environments to validate the 6× inference speed improvement from dynamic affordance selection and confirm safety margins in human-robot interaction scenarios.

Limitations and confidence

The model still fails on extreme object orientations (e.g., hammers positioned horizontally), indicating limits to generalization from the training distribution. All real-world tests used a single robot platform (Franka arm) with specific camera configurations; performance on different hardware requires validation. The simulation results, while promising, reflect a domain with less environmental complexity than many real applications. Training data came primarily from the Droid dataset supplemented by 692 task-specific demonstrations; substantially different task domains may require additional data collection. The automated affordance annotation pipeline depends on the accuracy of third-party vision models, which may introduce errors. Confidence is high that the approach improves over baselines in the tested scenarios, but moderate regarding performance in significantly different environments or with different robot morphologies until further validation is completed.

Jinming Li

1,∗,⋆{}^{1,*,\star}

, Yichen Zhu

2,†,⋆{}^{2,\dagger,\star}

, Zhibin Tang

⋆{}^{\star}

, Junjie Wen

3,⋆{}^{3,\star}

, Minjie Zhu

3,⋆{}^{3,\star}

, Xiaoyu Liu

1,⋆{}^{1,\star}

,
Chengmeng Li

1,⋆{}^{1,\star}

, Ran Cheng

{}^{2}

, Yaxin Peng

1,†,⋆{}^{1,\dagger,\star}

, Yan Peng

{}^{1}

, Feifei Feng

{}^{2}

^{1}

Shanghai University,

^{2}

Midea Group,

^{3}

East China Normal University

Co-first author, † Corresponding Author, $⋆\star$ Core Contributor

Abstract

Show me a brief summary.

In this section, the authors address whether robot models can improve performance in complex, multi-task environments by incorporating reasoning chains similar to OpenAI's O1 model, which uses extensive reasoning to solve difficult problems. They introduce Chain-of-Affordance (CoA-VLA), a novel approach that scales robot foundation models by integrating sequential reasoning through four types of affordances: object affordance identifies what to manipulate and its location, grasp affordance determines the optimal grasping point, spatial affordance finds the best placement space, and movement affordance plans collision-free trajectories. Each affordance is represented in both visual and textual formats, integrated into the policy network through a vision-language co-injection module that provides essential contextual information during action inference. Experiments demonstrate that CoA-VLA outperforms state-of-the-art models like OpenVLA and Octo across various tasks while exhibiting strong generalization to unseen object poses, free space identification, and obstacle avoidance in novel environments.

Robot foundation models, particularly Vision-Language-Action (VLA) models, have garnered significant attention for their ability to enhance robot policy learning, greatly improving robot's generalization and robustness. OpenAI’s recent model, O1, showcased impressive capabilities in solving complex problems by utilizing extensive reasoning chains. This prompts an important question: can robot models achieve better performance in multi-task, complex environments by reviewing prior observations and then providing task-specific reasoning to guide action prediction? In this paper, we introduce Chain-of-Affordance (CoA-VLA), a novel approach to scaling robot models by incorporating reasoning in the format of sequential robot affordances to facilitate task completion. Specifically, we prompt the model to consider the following four types of affordances before taking action: (1) object affordance — what object to manipulate and where it is; (2) grasp affordance — the specific object part to grasp; (3) spatial affordance — the optimal space to place the object; and (4) movement affordance — the collision-free path for movement. We further transform each affordance into two prompting formats: visual affordance and textual affordance. We introduce a novel vision-language co-injection module that integrates this knowledge into the policy network. This allows the robot to leverage essential contextual information during action inference, resulting in improved precision and robustness. Our experiments demonstrate that CoA-VLA outperforms state-of-the-art robot foundation models, including OpenVLA and Octo, on a variety of tasks. Furthermore, CoA-VLA exhibits strong generalization capabilities, including recognizing unseen object poses, identifying free space, and avoiding obstacles in novel environments.

1. Introduction

Show me a brief summary.

In this section, the authors address how recent Vision-Language-Action models have improved robot policy learning through internet-scale data but remain limited by their reliance on external LLMs or VLMs for high-level planning, preventing them from developing implicit reasoning capabilities. Inspired by OpenAI's O1 model, which demonstrated that extensive reasoning chains enhance performance on complex tasks, they propose Chain-of-Affordance (CoA-VLA) to enable self-driven reasoning in robotics. The approach introduces four sequential affordance types—object, grasp, spatial, and movement—that guide robots through progressive decision-making: identifying what to manipulate, how to grasp it, where to place it, and which collision-free path to follow. These affordances are presented in both text-based and image-based formats and integrated through a novel visual-language co-injection module. Experiments on simulated benchmarks and real-world tasks demonstrate that CoA-VLA outperforms state-of-the-art models while exhibiting strong generalization to unseen poses, obstacles, and novel environments.

Recent advancements in Vision-Language-Action (VLA) models have shown that training with internet-scale data can empower end-to-end policy learning models to outperform non-VLA models. However, current approaches often rely heavily on high-level planning or task decomposition by off-the-shelf Large language models(LLMs) or Vision language models (VLMs), limiting models from developing implicit reasoning on their own. OpenAI's recent O1 model has demonstrated that LLMs can improve performance on complex tasks through extensive reasoning chains. If this reasoning capability can be applied to the control model, it could enhance their action robustness and generalizability. Yet, the exploration of self-driven reasoning in robotics remains under-explored, highlighting an important frontier for future research.

In this work, we propose Chain-of-Affordance, namely CoA-VLA, a novel perspective on generalizing model reasoning at test-time, and leverage such generated reasoning to facilitate the policy learning process. Our model builds upon the DiffusionVLA [1], a state-of-the-art VLM model combining autoregression and diffusion objectives. Our method leverages visual affordance in robot learning, conceptualizing various actions and interactions with objects or the environment that a robot can perform based on visual contexts. We consider four types of affordance that are critical for robots to understand their observational surroundings to interact effectively with objects and achieve tasks in dynamic:

Object affordance. When a user provides vague instructions, the robot should be capable of identifying the target object for interaction and its location within the view.
Grasp affordance. It involves the robot assessing the object's most appropriate points or surfaces to enable secure and stable handling.
Spatial affordance. The robot needs to identify a set of coordinates that satisfy relationships described in language, such as free space for placement.
Movement affordance. Identifying a trajectory for the robot to move without collision is crucial in the real world to avoid catastrophic damage.

These four affordances form a sequential chain, requiring the robot to possess prior knowledge at each step to advance to the next. Specifically, at test time, the robot must first understand "what to manipulate and where the object is located." Next, it determines "how to grasp the object and finally", "where to place it," all while following a trajectory that ensures safe task completion. During inference, the affordance chain is progressively generated as the action state evolves, avoiding unnecessary computational costs associated with outputting extensive language. We introduce two formats for chain-of-affordance reasoning: text-based and image-based chain-of-affordance prompting. Our model's architecture natively supports text prompting integration, but image prompting requires adaptation. To address this, we developed an image affordance injection module that seamlessly integrates visual affordances into the policy network. Once generated by the model, this affordance knowledge is reused in policy learning, enabling the model to produce robust and generalizable actions.

We conduct extensive experiments on both simulated benchmarks and real-world tasks. Specifically, on the LIBERO [2] benchmark, our proposed CoA-VLA outperforms several other state-of-the-art approaches, including the Diffusion Policy [3], Octo [4], and OpenVLA [5]. Due to the substantial sim-to-real gap, many existing studies also evaluate real robots. We set up seven real-world robot tasks, including long-horizon and challenging tasks such as serving tea, cleaning garbage, and wiping tables. We perform multi-task learning on these real-world tasks and demonstrate that the VLA model, enhanced by chain-of-affordance reasoning, successfully handles most of these complex tasks with a high average success rate. Furthermore, CoA-VLA exhibits strong generalization capabilities to handle unseen object poses, adjust motion when obstacles appear, and identify free space for placement.

The core contribution of our work lies in a novel framework that enhances vision-language-action (VLA) models by integrating affordance-aware reasoning. Central to this framework is a new module designed to strategically infuse affordance knowledge into policy learning, enabling robots to better ground actions in physical and contextual understanding. While our approach draws inspiration from prior studies on affordance, the synthesis of these concepts with VLA systems represents a significant advancement. Specifically, our framework uniquely combines these components to forge more robust, interpretable, and generalizable robotic policies, addressing critical limitations in existing VLA systems that often overlook the role of object affordances in decision-making.

2. Related Works

Show me a brief summary.

In this section, the authors position their work within two research threads: affordance in robotics and reasoning for language and control. Affordance in robotics refers to the functional and spatial properties of objects that guide manipulation, traditionally represented through part segmentation, keypoints, dense features, or predictions from vision-language models for grasping and placement tasks. Meanwhile, reasoning approaches like chain-of-thought prompting have empowered large language models to solve complex problems by breaking them into steps, and recent robotics research has applied this by using LLMs and VLMs as high-level planners that decompose tasks into sub-goals or movement instructions for low-level execution. The authors distinguish their contribution by introducing a unified affordance taxonomy with four types—object, grasp, spatial, and movement—represented in both textual and visual formats, integrated through a dynamic selection mechanism that adaptively prioritizes task-relevant affordances at each timestep, achieving computationally efficient reasoning with improved robustness to environmental ambiguities.

Affordance in robotics. In robot learning, the concept of affordance is interpreted in various ways. Typically, affordance is defined as the functions of an object, encompassing what the object is, how it can be manipulated, and its spatial relationship to the target. This concept extends beyond visual properties, linking observations directly to actions. The efficacy of affordance prediction has been shown by many learning-based manipulation methods for 6-DoF grasping [6, 7, 8] and stable object placement. It can also represented in many ways [9, 10, 11, 12, 13, 14, 15, 16, 17] such as part segmentation, dense image feature descriptors and keypoints. Some methods leverage human videos to obtain affordances [18], while others use vision-language models (VLMs) to predict points representing spatial affordances [19, 20]. RT-Affordance [21] employs more descriptive affordance representations, and TraceVLA [22] incorporates visual traces as additional input to enhance VLA. Our approach introduces multiple types of affordances and formulates them as chain-of-thought reasoning to further improve VLA model performance.

Reasoning for language and control. Prompting large language models (LLMs) with "think step-by-step" [23] has significantly advanced their ability to solve complex tasks. Numerous approaches [24, 25] have since been developed to encourage deeper reasoning in LLMs, such as Tree-of-Thought [26] and Chain-of-Code [27], establishing this as a standard practice in language modeling. Recent research has leveraged LLMs and vision-language models (VLMs) [28, 29, 14, 30, 17, 31, 17] as high-level planners in robotics, often using fine-tuned, open-source models or closed-source LLMs alongside policy networks for low-level task execution. These studies illustrate that detailed reasoning can enhance low-level control. ECoT [32] introduces a reasoning strategy for VLMs that includes task decomposition, subtask descriptions, fine-grained movement instructions, gripper positioning, and object tracking on the table. CoT-VLA [33] generates sub-goa for autoregressive VLA to guide action. Unlike existing approaches, our work introduces an affordance taxonomy (categorized into four types) to unify textual and visual affordance representations in robot learning. Coupled with a dynamic selection mechanism, this taxonomy enables adaptive policy training, where task-relevant affordances are prioritized at each timestep. The resultant framework achieves computationally efficient reasoning while retaining robustness to environmental ambiguities.

3. Preliminary on Vision-Language-Action Models

Show me a brief summary.

In this section, Vision-Language-Action models are introduced as the foundation for chain-of-affordance policies, where a pre-trained vision-language model is fine-tuned to predict robot actions conditioned on image observations, task instructions, and reasoning. VLAs fall into two categories: autoregressive models that treat actions as discrete tokens and predict them sequentially like language generation, and diffusion-based VLAs that use policy heads such as diffusion models or flow matching to output continuous actions. The work builds specifically on DiVLA, which combines the Qwen2-VL vision-language model with a diffusion head for action prediction. This preliminary establishes the baseline architecture that will be enhanced by incorporating affordance-based reasoning, enabling the robot to explicitly reason about object manipulation, grasping points, spatial relationships, and movement trajectories before executing actions, thereby improving generalization and robustness in complex manipulation tasks.

Our work builds upon Vision-Language-Action (VLAs) as the backbone for our chain-of-affordance policies. VLAs employ a straightforward policy-learning approach: beginning with a pre-trained vision-language model, they fine-tune it to predict the next robot action

a

based on the current image observation

I

, task instruction

T

, and reasoning

r

. There are two types of VLAs: autoregressive VLAs [34, 35, 22, 32, 5, 36, 37] and diffusionVLAs [38, 39, 40, 1]. The former uses discrete action tokens within the vision-language model’s vocabulary, enabling action generation similar to language modeling through next-token prediction. The latter leverages a policy head [15, 41, 42], such as a diffusion policy [38, 43] or flow matching [40], to output continuous robot actions.

In this work, we employ the recently released DiVLA [1] model, which integrates the Qwen2-VL [44] vision-language model with a diffusion model head for action prediction. In the following sections, we will discuss how we improve this VLA by enabling it to reason through robot affordances before selecting an action.

4. Methodology

Show me a brief summary.

In this section, the authors tackle the challenge of enhancing robot policy learning by introducing Chain-of-Affordance (CoA), a structured reasoning framework that guides action prediction through four sequential affordance types: object affordance (identifying what to manipulate and its location), grasp affordance (determining optimal grasping points), spatial affordance (finding suitable placement coordinates), and movement affordance (planning collision-free trajectories). These affordances are represented in two complementary formats—textual descriptions and visual overlays on observation images—which are integrated into the policy network through a novel visual-language co-injection module that fuses both modalities using Vision Transformer encoders, Transformer blocks, and FiLM conditioning layers. To avoid computational overhead, a dynamic affordance selection mechanism leverages proprioceptive data to adaptively choose only task-relevant affordances at each timestep. The methodology also includes an automated pipeline using GPT-4o, Grounding DINOv2, SAM, RoboPoint, and CoTracker to generate large-scale chain-of-affordance annotations, enabling scalable training without extensive human labeling.

This section introduces our approach to explicitly leveraging affordance as a foundation for the robot models. In Section 4.1, we formally define the concept of chain of affordance as a structured sequence of affordances, each representing an actionable insight. We provide detailed descriptions of four distinct types of affordances that together constitute this chain, explaining how each type contributes to understanding and executing complex tasks. In Section 4.2, we present two formats for representing the chain of affordances: a text format and an image format. We then discuss how these representations can be integrated into the policy learning process. Finally, in Section 4.3, we outline the pipeline used to automatically generate large-scale chain-of-affordance data.

4.1 Definition of Chain-of-Affordance

Given a dataset

D={(τ1,g1),…,(τN,gN)}\mathcal{D} = \{(\tau_1, g_1), \dots, (\tau_N, g_N)\}

consisting of

N

expert demonstrations, where each demonstration

τi\tau_i

is paired with a task description

g_i

in natural language. Each task description

\in \mathcal{G}

specifies a composition of multiple sub-tasks, and each demonstration

τi\tau_i

is represented by a sequence of observations. We define

\in \mathcal{Z}

as the affordance-based reasoning in natural language that guides the task. The model decomposes

z

into four components,

z = \{z_{obj}, z_{grasp}, z_{spat}, z_{move}\}

, where

z_{obj}, z_{grasp}, z_{spat}

, and

z_{move}

represent object, grasp, spatial, and movement affordances, respectively. Our objective is to learn an intermediate language output

\mathcal{O} \times \mathcal{G} \rightarrow \mathcal{Z}

that maps observations and task descriptions to affordance reasoning in natural language. This intermediate output provides specific guidance for action generation, enabling the generation of low-level actions

\in \mathcal{A}

. Note that low-level actions are generated conditioned on the demonstration, task description, and affordance reasoning:

\sim p(a|\tau, g, z)

Details for diverse robot affordance. In our approach, we model the robot affordance in the format of natural language as intermediate outputs. We will give detailed description on each type of affordance.

Object affordance: Object affordance equips robots with the foundational ability to autonomously determine which object to interact with and where it is located, particularly in scenarios where user instructions lack explicit spatial or semantic details. This capability bridges the gap between ambiguous queries (e.g., "Pour the drink") and actionable execution by enabling the robot to: 1) Identify the target object through natural language grounding (e.g., resolving "drink" to "teapot"), 2) Localize the object within its environment using pixel-aligned bounding box predictions. In our framework, object affordance is operationalized through two tightly coupled components: 1) Semantic identification: Resolving object names from free-form language input, and 2) Spatial grounding: Predicting the object’s 2D location via visual scene understanding. By integrating these capabilities, the robot establishes a contextual foundation for downstream decision-making, ensuring interactions are both intention-aware (aligned with user goals) and environmentally grounded (physically feasible).

Grasp affordance: Grasp affordance encompasses the possible functions or ways an object can be manipulated. This affordance goes beyond visual characteristics, linking observations directly to actions, and is crucial for tasks requiring 6-DoF (degrees of freedom) grasping [13, 45, 9]. Prior work has demonstrated the effectiveness of affordance prediction for stable object handling and placement. Representations of grasp affordance vary, including part segmentation or keypoints. In our work, we use a set of 2D points to represent the grasping point for an object.

Spatial affordance: Spatial affordance centers on a robot’s ability to interpret and reason about spatial relationships within 3D environments, enabling tasks such as identifying collision-free regions for object placement or navigation. For instance, RoboPoint [20] locates free space, and SpatialVLM [19] predicts spatial relations quantitatively and qualitatively. In our framework, spatial affordance is operationalized as actionable destinations—discrete 2D coordinates representing feasible interaction zones.

Movement affordance: Movement affordance defines the trajectory a robot can follow during a task. This path may change depending on environmental factors, such as obstacles introduced along an intended trajectory. By modeling movement affordance, we provide the robot with adaptable paths for action, allowing it to respond dynamically to environmental changes and complete its task effectively. These affordances collectively enable the robot to understand and act upon various elements within its operational space, enhancing its interaction capabilities and responsiveness.

Dynamic affordance selection. Our proposed method formulates affordances as combined text and visual prompts. While this approach offers benefits, it introduces additional computation at test time, potentially slowing down the algorithm. We observe that it's unnecessary to utilize all available affordances during testing. For example, once an object is picked up, predicting its object affordance and the grasp affordance becomes redundant. The specific affordances required in a given sub-step depend on the task's progress and the environment. Therefore, we implement dynamic affordance selection, adaptively choosing the necessary affordances at both training and test times to reduce computational cost. Several methods can achieve this, such as gradient-based selection [46]. Our approach prioritizes simplicity by leveraging proprioception. Proprioception refers to information about the robot's state, including joint angles and other movement data. We transform this proprioceptive data into a single token and concatenate it with the visual token before feeding it into the large language model. Training on large-scale annotated datasets like Droid [47] enables our model to intelligently select relevant affordances at each time step. This is straightforward for the model to learn. For instance, if the proprioceptive state indicates a partially closed gripper and the wrist-mounted camera detects an object, the model can infer that the object and grasp affordances are likely unnecessary. Instead, it can focus on the movement affordance to guide the action trajectory and the spatial affordance to determine a suitable placement location. We found this strategy to be simple and useful, reducing the model's computational cost without hurting performance.

4.2 Formatting of Visual-Textual Chain-of-Affordance

In this section, we present a multimodal chain-of-affordance framework that formalizes affordance reasoning through two complementary representations: (1) textual affordance, which provides a structured, language-driven representation of actionable possibilities (e.g., "graspable handle" or "navigable path"), and (2) visual affordance, which delivers a spatially grounded, scene-aware perspective via pixel-aligned cues. While the textual format enables explicit semantic reasoning, the visual modality enhances interpretability in visually dense or ambiguous environments by grounding affordances in observable scene geometry.

To unify these modalities, we introduce a visual-language co-injection module that dynamically aligns and integrates visual affordance prompts in a shared parameter space. This module bridges the gap between abstract language-based reasoning and pixel-level visual context, enabling the policy model to synergistically leverage both modalities for robust, context-aware action generation.

Textual affordance. Natural language serves as the predominant modality for representing visual affordances due to its semantic richness and compatibility with human-AI interaction. For instance, object affordances can be encoded as coordinate pairs (e.g., bounding boxes localizing a "graspable cup"), while spatial affordances are expressed as coordinate pairs defining navigable regions (e.g., "reachable shelf area"). Figure 2 (top) illustrates how diverse affordances (e.g., "movable," "stackable") are mapped to natural language descriptors. To avoid rigid linguistic templates that might bias policy learning, we employ ChatGPT to dynamically paraphrase affordance descriptions with varied syntactic structures and vocabulary (e.g., alternating between "placeable surface" and "flat area for stacking"). This strategy, analogous to data augmentation in NLP, enhances the model’s robustness to linguistic variability while preserving its conversational fluency.

Visual affordance. Inspired by visual prompting methods that enhance model interpretability, we propose visual affordance augmentation—a technique that augments natural language affordance descriptions with direct, pixel-aligned visual cues. This approach encodes affordances by overlaying coordinate markers (e.g., bounding boxes, interaction points) or motion trajectories onto the robot’s historical observation frames (Figure 2, bottom). These annotations act as chain-of-affordance prompts, visually grounding the model’s understanding of actionable properties like "graspable" or "navigable" within the scene’s geometry. By embedding affordances directly into the visual input, we create an explicit structure that bridges the gap between abstract language and actionable visual context. For movement affordances, we employ thin, low-saliency trajectories to avoid overshadowing critical scene elements while guiding motion planning. Conversely, key interaction zones — such as grasping points, object bounding boxes, and spatial affordances — are rendered with thicker, high-contrast overlays (e.g., semi-transparent colors) to ensure visual salience without occluding environmental context. This hierarchical visual encoding distinguishes affordance types at a glance, enabling the model to prioritize task-relevant cues while maintaining computational efficiency.

Visual-textual co-injection. To integrate visual and textual affordances into action generation, we design a unified embedding pipeline that seamlessly fuses both modalities into the diffusion policy. For textual affordances, we use the last embedding from the VLM models and add an MLP layer to tokenize it. Concurrently, visual affordances are processed into patch tokens via a pretrained Vision Transformer (ViT-Small). The fused tokens are then processed by two standard Transformer blocks, balancing computational efficiency with sufficient expressive power for cross-modal reasoning. The encoder’s output embeddings are projected into the diffusion model using FiLM conditioning layers, inspired by MT-ACT [48]. This conditioning dynamically modulates the diffusion process by injecting affordance-aware features, enabling the policy to generate actions that respect both spatial constraints and semantic intent. By unifying affordance modalities early in the pipeline, our method retains the computational efficiency of standard diffusion frameworks while enhancing robustness. The FiLM-based conditioning acts as a bottleneck, distilling only the most salient affordance cues to guide action generation, thereby avoiding overfitting to redundant visual or linguistic signals.

Table 1: Experimental results for multi-task learning. Our method achieved the best performance in both the in-distribution test setup and under visual changes.

	Seven Tasks on Franka Robot Arm
Model $\setminus$ Tasks	In-Distribution
Model $\setminus$ Tasks	CleanTrash	PourTea	NailHammer	PlaceBread	PlaceCar	WipeWater	HangCup	Avg.
Diffusion Policy [3]	4/11	0	8/11	3/11	9/11	2/11	7/11	33/77 (42.93%)
Octo [4]	4/11	0/11	7/11	4/11	8/11	3/11	8/11	34/77 (44.13%)

4.3 Generating Chain-of-Affordance Data

To prevent overfitting on affordance diversity, generating large-scale, high-quality data is crucial. While the standard approach typically relies on direct human labeling, this method is both costly and labor-intensive. Therefore, in this section, we outline a comprehensive pipeline for automatically generating diverse affordances using a range of tools to streamline the data generation process.

Our pipeline begins with GPT-4o [49], which generates a detailed description of the scene and identifies relevant entities from the language instructions. This allows us to create contextually rich, entity-specific affordances tailored to the task at hand. Using these entities, we leverage Grounding DINOv2 [50] and SAM [51, 52] to produce bounding boxes around each identified object within the scene. SAM initially provides masks, which we convert into bounding boxes, and we finalize these by taking out the intersection over union (IoU) between the outputs of Grounding DINOv2 and SAM. This IoU-based refinement ensures that bounding boxes are accurately aligned with object contours. At this stage, we also capture the gripper’s position for subsequent affordance calculations. To represent spatial affordance, we integrate RoboPoint [20], a state-of-the-art model that predicts spatial affordances directly within the image. Additionally, we prompt GPT-4o to annotate spatial points based on the scene context. The spatial predictions from RoboPoint and GPT-4o are then combined, after which we cluster these points to form a coherent representation, eliminating any outliers to maintain accuracy. This process captures spatial affordances that are not only visually aligned but contextually relevant. For capturing movement trajectories, we employ CoTracker [53, 54], an advanced transformer-based tracking model. CoTracker enables us to follow the robot gripper's path, recording its movement through the scene and gathering essential trajectory data. This movement data provides insight into how the robot interacts dynamically with the environment, adding temporal dimensions to our affordance representation. The combination of spatial affordances, grasping points, and movement trajectories enables us to model a rich, multi-faceted affordance landscape for each scenario. This automated, tool-assisted pipeline produces a detailed and diverse set of affordance annotations, significantly reducing the need for manual labeling.

Table 2: Experimental results for LIBERO benchmark. We report the success rate and standard error for four task suites.

	LIBERO-Spatial	LIBERO-Object	LIBERO-Goal	LIBERO-Long	Average
Method / Task	Success Rate ( $\uparrow$ )	Success Rate ( $\uparrow$ )	Success Rate ( $\uparrow$ )	Success Rate ( $\uparrow$ )	Success Rate ( $\uparrow$ )
Diffusion Policy [3]	$78.3 \pm 1.1\%$	$92.5 \pm 0.7\%$	$68.3 \pm 1.2\%$	$50.5 \pm 1.3\%$	$72.4 \pm 0.7\%$
ScaleDP [55]	$79.1 \pm 0.7\%$	$90.4 \pm 0.9\%$	$73.6 \pm 0.8\%$	$48.4 \pm 1.2\%$	$72.9 \pm 0.5\%$
Octo [4]	$78.9 \pm 1.0\%$	$85.7 \pm 0.9\%$	$84.6 \pm 0.9\%$	$51.1 \pm 1.3\%$	$75.1 \pm 0.6\%$

5. Experiments

Show me a brief summary.

In this section, the authors evaluate CoA-VLA's performance through real-world robot experiments and simulation benchmarks to determine whether integrating chain-of-affordance reasoning improves policy learning compared to baseline vision-language-action models. Using a Franka robot arm across seven challenging manipulation tasks, CoA-VLA achieves 85.54% success rate in-distribution and 57.14% under visual generalization conditions, substantially outperforming OpenVLA, Octo, and DiffusionVLA despite having a smaller model size and less pre-training data. On the LIBERO simulation benchmark spanning four task suites, CoA-VLA attains 79.8% average success rate, exceeding OpenVLA by 3.3%. Critically, the method demonstrates superior spatial reasoning by identifying free placement areas and robust obstacle avoidance through movement affordance integration, completing all tested scenarios where baselines largely failed. These results confirm that explicit affordance reasoning—spanning object, grasp, spatial, and movement dimensions—significantly enhances generalization, safety, and task completion in complex robotic manipulation.

The objective of our experimental evaluations is to assess the efficacy of CoA-VLA as a reasoning framework that augments the policy-learning capabilities of a baseline robot foundation model. Specifically, we seek to address the following questions: 1) Does the VLA model demonstrate improved performance with the integration of CoA in real-world experiments? 2) What is the performance of CoA-VLA in simulation benchmarks? 3) How effectively does CoA-VLA generalize to challenging scenarios? 4) How important is our proposed visual-textual affordance approach, compared to the vanilla VLA? We will give a detailed analysis in this section. Note that some experiments are presented in the Appendix due to limited space.

5.1 Evaluation on Real Robot

Experimental setup. We conduct evaluations using the 6-DoF Franka robot arm, a widely adopted setup for assessing generalizable robot policies. Our setup includes two third-person cameras (ZED cameras) positioned on either side of the robot arm, as well as an egocentric camera (Realsense D435i) mounted on the wrist. We design seven challenging tasks, such as pouring tea and cleaning the table, incorporating both short-horizon and long-horizon tasks to test the adaptability of our approach. We illustrate our robot setup and tasks with examples in Figure 3. Detailed descriptions of each task and the experimental setup, and our ablation experiments are provided in the Appendix.

Train data. We use the Droid dataset [47] as an external data source, filtering out samples without language annotations, leaving 39K trajectories. Our pipeline generates synthetic chain-of-affordance data for pre-training. The model is then post-trained on 692 trajectories across seven tasks.

Baseline. We compare CoA-VLA with Diffusion Policy [3], Octo [4], OpenVLA [5], TinyVLA [38], DiffusionVLA [1]. The latter three methods are all vision-language-action models that achieve state-of-the-art performance in the real world. Notably, the DiffusionVLA is the same model our approach is built upon, but trained with vanilla reasoning instead of our proposed chain-of-affordance. To ensure a fair comparison, all models are fine-tuned on the same dataset we use for training our approach. All models are trained with the same number of iterations, and the last checkpoint is used for evaluation. It ensures that no model is cherry-picked for comparison.

Real robot experimental results. The experimental results are presented in Table 1. We assess the performance using two different settings: the in-distribution setup and visual generalization testing. In the in-distribution setup, our method surpasses all SOTA robot foundation models in terms of average success rate. Notably, our method exceeds the performance of OpenVLA by 30.65%, despite having a significantly smaller model size and less pre-training data. Compared to our baseline model, which employs vanilla reasoning, our method achieves a 14.29% increase in accuracy. To assess visual generalization, we further evaluate these advanced models under varying visual conditions. As observations become more complex—such as through the addition of distractors or vibrant backgrounds—the performance gap between CoA-VLA and OpenVLA/DiffusionVLA widens. These findings highlight the critical role of embodied, task-specific chain-of-affordance reasoning in enhancing the performance of VLA models.

5.2 Evaluation on Simulation

In this section, we examine the performance of CoA-VLA on LIBERO [2]. LIBERO is a robot learning benchmark comprising over 130 language-conditioned manipulation tasks. We follow the setting as in OpenVLA [5] open-sourced code and test on four task suites: LIBERO-Spatial, LIBERO-Goal, LIBERO-Object, and LIBERO-Long. The detailed experimental setup is in the Appendix.

Simulation experimental results. We compare with the Diffusion Policy [3], ScaleDP [55], Octo [4], and OpenVLA [5]. The experimental results of LIBERO are presented in Table 2. Our findings indicate that CoA-VLA consistently achieves superior performance across all evaluated settings, securing the highest success rate among the methods tested. Specifically, CoA-VLA achieves an overall success rate of 79.8%, outperforming OpenVLA, the previous best-performing method, by a margin of 3.3%. This improvement demonstrates the effectiveness of our approach in simulation. Compared to pre-trained robotic models like Octo and OpenVLA, CoA-VLA demonstrates non-trivial gains, showing its effectiveness in leveraging prior knowledge while adapting to diverse task requirements. Additionally, when benchmarked against train-from-scratch methods such as Diffusion Policy and ScaleDP, CoA-VLA exhibits substantial improvements, underscoring the advantages of our approach.

5.3 More Experiments

Spatial affordance enables safe placement. We evaluate the effectiveness of our spatial affordance approach in the PlaceBread task, as illustrated in Figure 4. In this task, the robot is presented with a plate on which three distinct objects are already placed, and it is instructed to add a piece of bread onto the plate. Our method successfully identifies open areas on the plate, allowing it to accurately position the bread without interference, thereby enabling CoA-VLA to complete all three task scenarios with precision. In contrast, alternative methods, such as OpenVLA and DiffusionVLA, succeed in only one scenario each, failing to generalize across all spatial configurations. This experiment underscores the crucial role of spatial affordance in enhancing the model’s ability to recognize and utilize available space, ultimately improving task completion accuracy and reliability across varied setups.

Obstacle avoidance. Collision avoidance is essential for safe and effective physical interactions, as improper maneuvers can lead to significant damage or even catastrophic outcomes. We evaluated CoA-VLA's obstacle avoidance capabilities in two specific tasks, illustrated in Figure 5. In the first task, we positioned a vase near the center of the task area to observe whether CoA-VLA could successfully navigate around the vase to retrieve a piece of paper trash. In the second task, we introduced a series of obstacles on a table, rearranging them in different configurations to assess the robot’s adaptability in maneuvering through varied layouts to complete the task. Our approach successfully completed all three scenarios, demonstrating robust collision avoidance and spatial adaptability. In contrast, OpenVLA failed to complete any of the tasks, while DiVLA succeeded in only one scenario. These results highlight the critical role of integrating movement affordance into the model’s reasoning process, enhancing its ability to navigate complex environments and complete tasks with precision and safety.

6. Conclusion

Show me a brief summary.

In this section, the authors establish that explicit reasoning is crucial for language models to manage complex tasks and introduce CoA-VLA, a reasoning-aware foundation model for robotics that centers on four interdependent affordances: object, grasp, spatial, and movement. The model structures these affordances as a chain where the robot sequentially identifies the target object and location, determines the appropriate grasp, decides on placement, and plans navigation accordingly. By representing this chain of affordances through intermediate language and image outputs that feed into the policy model, CoA-VLA demonstrates superior performance compared to baseline methods on real-world robotic tasks. The model exhibits strong generalization capabilities in complex environments, successfully handling challenges such as grasping objects in unfamiliar orientations, avoiding obstacles, and adapting to varied spatial configurations. This approach offers a novel perspective on designing reasoning chains to enhance embodied control in robotic systems.

Explicit reasoning is essential for language models to handle complex tasks. In this work, we design a reasoning-aware foundation model for robotics, focusing on various affordances: object, grasp, spatial, and movement. These affordances form an interdependent chain: the robot identifies the target object and location, determines how to grasp it, decides where to place it, and navigates accordingly. By structuring this chain of affordances as intermediate language and image outputs and feeding this reasoning into the policy model, our Chain-of-Affordances (CoA-VLA) model outperforms baselines on real-world robotic tasks. CoA-VLA models also generalize well to complex environments, tackling challenges like grasping in unfamiliar orientations, avoiding obstacles, and spatial generalization. Our approach provides a novel perspective on designing reasoning chains to enhance embodied control.

Acknowledegments

Show me a brief summary.

In this section, the authors acknowledge the financial support that enabled this research work. The project received funding from two primary sources in China: the National Science Foundation of China under grant number 12471501, and the Sci-Tech Innovation Initiative administered by the Science and Technology Commission of Shanghai Municipality under grant number 24ZR1419000. These funding mechanisms provided the necessary resources to develop and evaluate CoA-VLA, the Chain-of-Affordance Vision-Language-Action model for robotic manipulation tasks. The acknowledgment recognizes the critical role of institutional support in advancing research at the intersection of robotics, computer vision, and language models, particularly for conducting extensive real-world experiments with robotic hardware and developing novel reasoning frameworks that enhance robot foundation models' ability to generalize across diverse manipulation scenarios.

This work is supported by the National Science Foundation of China (12471501), and the Sci-Tech Innovation Initiative by the Science and Technology Commission of Shanghai Municipality (24ZR1419000).

Supplementary Material

Show me an executive summary.

Purpose and context

Vision-Language-Action (VLA) models enable robots to learn manipulation policies from visual observations and language instructions. Current VLA models often rely on external large language models for high-level planning, limiting their ability to develop internal reasoning capabilities. This work introduces Chain-of-Affordance (CoA-VLA), a method that teaches robot models to reason about affordances—the actionable properties of objects and environments—before executing actions. The goal is to improve task performance, generalization to new scenarios, and safety in complex manipulation tasks.

What was done

We developed CoA-VLA by extending an existing VLA model (DiffusionVLA) with structured reasoning about four types of affordances: object affordance (what to manipulate and where it is), grasp affordance (how to grip the object), spatial affordance (where to place it), and movement affordance (collision-free trajectories). The model generates these affordances in two formats—natural language descriptions and visual overlays on camera images—and integrates both into policy learning through a novel visual-textual co-injection module. To reduce computational cost, we implemented dynamic affordance selection, which adaptively chooses only the relevant affordances at each timestep based on the robot's proprioceptive state. We created training data by developing an automated pipeline that uses GPT-4o, object detection models (Grounding DINOv2, SAM), spatial reasoning models (RoboPoint), and motion tracking (CoTracker) to generate affordance annotations at scale.

Main findings

CoA-VLA substantially outperformed state-of-the-art robot models in both simulated and real-world experiments. On seven real-world manipulation tasks using a Franka robot arm, CoA-VLA achieved an 85.5% success rate in standard conditions, exceeding OpenVLA by 30.7 percentage points and the base DiffusionVLA model by 8.9 points. Under visually challenging conditions with distractors and varied lighting, CoA-VLA maintained a 57.1% success rate compared to 44.4% for DiffusionVLA and 22.2% for OpenVLA. In simulation benchmarks (LIBERO), CoA-VLA reached 79.8% average success rate across four task suites, outperforming OpenVLA by 3.3 points. The model demonstrated strong generalization: it successfully identified free space for object placement, avoided obstacles by planning collision-free paths, and grasped objects in previously unseen orientations. Ablation studies confirmed that both visual and textual affordances contribute to performance, with textual affordances having stronger influence, and that dynamic affordance selection maintains accuracy while increasing inference speed sixfold.

What the findings mean

These results show that explicit affordance reasoning significantly improves robot manipulation performance and robustness without requiring larger models or more pre-training data. The improvements are particularly pronounced in challenging scenarios involving visual clutter, obstacles, and ambiguous placement locations—situations where cost of failure is high and safety is critical. The model's ability to generalize to unseen object poses and dynamically avoid obstacles reduces the need for exhaustive training data covering every possible scenario, potentially lowering deployment costs and training time. The computational efficiency gain from dynamic affordance selection makes the approach practical for real-time robotic control at 6Hz compared to 1Hz when using all affordances indiscriminately.

Recommendations and next steps

Deploy CoA-VLA for manipulation tasks where spatial reasoning, obstacle avoidance, and adaptability to visual variation are critical, particularly in unstructured environments with changing object configurations. For new task domains, use the automated affordance annotation pipeline to generate training data efficiently rather than relying on manual labeling. Consider CoA-VLA when model size and inference speed are constraints, as it achieves strong performance with smaller model size than alternatives like OpenVLA. Before broader deployment, evaluate the model on additional long-horizon tasks and test failure modes when objects are positioned in extreme orientations (e.g., horizontally), as current results show limitations in these cases. Future work should explore extending the affordance taxonomy to additional manipulation primitives and testing on mobile manipulation platforms beyond fixed-arm setups.

Limitations and confidence

The real-world evaluation covered seven tasks with 692 training demonstrations plus external data; performance may vary on task types not represented in this set. The model struggles with object grasps when handles are oriented horizontally relative to the robot, indicating limits to pose generalization. Simulation results are based on the LIBERO benchmark's specific task distribution and may not fully predict performance in other simulated or real environments. The automated affordance annotation pipeline depends on the accuracy of underlying models (GPT-4o, SAM, etc.), which may introduce noise in training data. Confidence is high that CoA-VLA provides meaningful improvements over baseline VLA models in the tested scenarios, moderate confidence that similar gains will transfer to related manipulation tasks, and lower confidence for task domains substantially different from those evaluated (e.g., deformable object manipulation, bimanual tasks).

6.1. Video Demo

We provide a video recording in the supplementary material.

Table 3: Summarization for the number of demonstrations and average trajectory length for our real-world tasks.

#	Task	# of Demonstrations	Average Trajectory Length
1	PlaceCar	89	301.8
2	PlaceBread	102	113.2
3	NailHammer	80	182.8

6.2. Evaluation Tasks

In this section, we give a detailed description of the evaluated tasks that we discussed in Table 1. We provide the number of demonstrations for each task and the average trajectory length in Table 3.

PlaceCar. We randomly place the toy car on the right side of the drawer. The model is asked to pick up the toy car, put it into the drawer, and eventually close the drawer. This is a long-horizon task that requires multiple steps of action.
PlaceBread. The model needs to pick up the bread and place it on an empty spot on the plate, avoiding placing it on the fruit. The bread is randomly placed on the table. The model needs to pick up the bread and place it on the empty spot on the plate.
NailHammer. We evaluate the model's proficiency in utilizing tools effectively by assessing its ability to perform a sequence of precise actions with a hammer. The model must first identify the correct grasp point on the hammer, ensuring a stable and ergonomic grip suitable for controlled operation. It must then carefully pick up the hammer without causing it to topple or disturb its surroundings. Once the hammer is securely held, the model is tasked with driving a nail into a designated spot with precision.
PourTea. In this task, the robot is required to perform a sequence of actions involving a tea cup and a teapot. First, the robot must place the tea cup onto the tea tray. Next, it needs to pick up the teapot and pour tea into the teacup. Both the tea cup and the teapot are randomly positioned within a defined range on the table. A key aspect of the task is the robot's ability to accurately grasp the teapot by its stem. To ensure consistency during data collection, the tea pot's stem is always oriented facing the robot, simplifying the grasping process while still challenging the model's precision and manipulation skills.
CleanTrash. In this task, the robot is required to perform a sequence of actions to clean up trash on a table. The task has two distinct scenarios. In the first scenario, with no obstacles, the robot must identify and pick up the randomly placed trash, then deposit it into the trash bin. The trash items are distributed across the table in a random manner. In the second scenario, a flower pot is placed on the table as an obstacle. The robot must avoid colliding with the flower pot while picking up the trash and placing it into the trash bin. The trash's location remains random, and the robot must navigate carefully to avoid knocking over the flower pot during the cleanup process. A key aspect of this task is the robot's ability to accurately avoid the flower pot while maintaining efficiency in picking up and discarding the trash.
WiperWater In this task, the robot is required to clean up water from a table by using a sponge. The sponge is placed on the right side of the table, and the robot must pick it up and use it to wipe the water from the surface, moving from right to left. During this process, the robot must avoid any objects placed on the table, such as vases, cups, boxes, and other items. A key challenge in this task is the robot's ability to manipulate the sponge effectively while navigating around the obstacles without causing any collisions, ensuring that the entire table is cleaned efficiently. The robot's precision in both grasping the sponge and avoiding the table items is critical for completing the task successfully.
HangCup In this task, the robot is required to pick up cups that are randomly scattered on the table and hang them on a cup rack. The robot must handle the cups carefully to avoid damaging them and ensure that the rack is not disturbed or knocked over during the process. The task challenges the robot's precision in both grasping the cups and placing them securely on the rack while maintaining stability in the environment. Successful completion relies on careful manipulation and accurate placement.

Setup for visual generalization. In this scenario, we evaluate the model's robustness and its ability to generalize visual perception across diverse and challenging environmental conditions. The robot is tasked with performing manipulation tasks while navigating visual complexities such as randomly placed distractors, varying lighting conditions, and a visually cluttered, colorful background. These challenges are designed to test the model's capability to stay focused on the primary task, effectively filter out irrelevant visual distractions, and adapt to dynamic and unpredictable visual environments. The objective is to ensure the robot can consistently and accurately identify and interact with target objects, even under significant deviations from typical operational settings.

6.3. Details for Real Robot Experiments

We train our method in a multi-task setting without relying on pre-trained weights from DiffusionVLA. Instead, we leverage our constructed dataset for pre-training. Specifically, we initialize the learning rate at 2e-5 and maintain a fixed learning rate throughout the pre-training phase, which spans 5 epochs. During this stage, the parameters of the pre-trained Vision-Language Model (VLM) are frozen, and LoRA is employed to fine-tune the model. For fine-tuning, we adopt a similar approach, starting with an initial learning rate of 2e-6. However, in this phase, we apply a cosine learning rate decay schedule and train the model for an additional 5 epochs. This training strategy ensures both effective adaptation and stability across pre-training and fine-tuning stages, optimizing the model for multi-task performance.

For the baselines, we generally adopt a consistent training strategy. In the case of OpenVLA, the vanilla implementation utilizes only a single camera view. To extend this, we incorporate all three camera views, feeding each view into the same visual encoder and concatenating their outputs for processing. We leverage OpenVLA's pre-trained weights and trains for 20 epochs, as we observe that it typically requires a longer training time to achieve convergence. For the Diffusion Policy, we utilize DistilBERT to process language instructions, following an approach similar to YAY [56]. As for DiffusionVLA, we employ their pre-trained weights and construct a reasoning dataset using their data construction pipeline to maintain consistency with their methodology. To ensure fair evaluation, we use the final checkpoints of all models, including ours, avoiding any form of cherry-picking. This approach allows for a robust comparison and highlights the performance differences across the various models.

Table 4: Ablation study on visual affordance and textual affordance. Our experiments demonstrate that both affordance are important for VLA.

	LIBERO-Spatial	LIBERO-Object	LIBERO-Goal	LIBERO-Long	Average
Method / Task	Success Rate ( $\uparrow$ )	Success Rate ( $\uparrow$ )	Success Rate ( $\uparrow$ )	Success Rate ( $\uparrow$ )	Success Rate ( $\uparrow$ )
CoA-VLA	$85.3 \pm 0.9\%$	$93.1 \pm 1.0\%$	$85.8 \pm 0.9\%$	$55.0 \pm 1.2\%$	$79.8 \pm 0.5\%$
w/o visual affordance	$84.3 \pm 0.5\%$	$91.5 \pm 0.7\%$	$83.9 \pm 1.0\%$	$54.6 \pm 1.2\%$	$78.6 \pm 0.9\%$
w/o textual affordance	$81.6 \pm 0.7\%$	$89.8 \pm 0.9\%$	$80.1 \pm 1.0\%$	$52.5 \pm 0.9\%$	$76.0 \pm 0.9\%$

Table 5: Ablation study on dynamic affordance selection. Removing dynamic affordance selection causes introduction of redundant affordance into the learning process, which cause the model to perform even worse than the baseline without it.

	LIBERO-Spatial	LIBERO-Object	LIBERO-Goal	LIBERO-Long	Average	Inference
Method / Task	Success Rate ( $\uparrow$ )	Success Rate ( $\uparrow$ )	Success Rate ( $\uparrow$ )	Success Rate ( $\uparrow$ )	Success Rate ( $\uparrow$ )	Speed
CoA-VLA	$85.3 \pm 0.9\%$	$93.1 \pm 1.0\%$	$85.8 \pm 0.9\%$	$55.0 \pm 1.2\%$	$79.8 \pm 0.5\%$	6Hz
- dynamic affordance selection	$85.1 \pm 0.9\%$	$92.4 \pm 1.0\%$	$85.2 \pm 1.0\%$	$55.2 \pm 1.1\%$	$79.5 \pm 1.0\%$	1Hz

6.4. Details for LIBERO Simulation

LIBERO is a robot learning benchmark comprising over 130 language-conditioned manipulation tasks. We follow the setting as in OpenVLA [5] open-sourced code and test on four task suites: LIBERO-Spatial, LIBERO-Goal, LIBERO-Object, and LIBERO-Long.

Each suite includes 10 distinct tasks with 50 demonstrations per task. Each task suite emphasizes unique challenges in imitation learning: LIBERO-Goal features tasks with similar object categories but different goals. LIBERO-Spatial requires policies to adapt to varying spatial arrangements of the same objects. LIBERO-Object keeps the layout consistent while changing the objects. During experimentation, our method uses a static camera, and a wrist-mounted camera all methods are evaluated across 1500 trials in total. We filter out the failure data and increase the image resolution to

224 \times 224

. The affordance data is generated using our proposed pipeline for data in LIBERO. In Table 2, we directly cite the results of Diffusion Policy, Octo, and OpenVLA from OpenVLA's paper. Therefore, to ensure all methods are evaluated fairly, we evaluated our methods across 500 trials for each task suite, and the reported performance is the average success rate over three random seeds. We use the same test data as in OpenVLA. For the baseline ScaleDP, except for using all camera views, all other implementations kept the same.

7. More Experiments

Show me a brief summary.

In this section, the authors investigate the individual contributions of visual and textual affordances, the computational efficiency of dynamic affordance selection, and the model's ability to generalize to unseen object orientations. Ablation studies on LIBERO demonstrate that both visual and textual affordances are critical for performance, with textual affordances showing stronger influence due to their capacity to encode task-specific semantics like "graspable" or "pour-able." Dynamic affordance selection proves essential, as using all affordances indiscriminately introduces optimization noise and reduces inference speed by six times compared to the selective approach. Finally, evaluation on novel object poses reveals that CoA-VLA successfully grasps hammers and teapots in unfamiliar orientations not seen during training, significantly outperforming OpenVLA, though all models fail when objects are positioned horizontally, indicating that grasp affordance substantially enhances generalization to varied spatial configurations despite remaining limitations.

7.1. Ablation Study on Visual-Textual Affordance

Our primary contribution lies in the introduction of textual affordances and visual affordances, paired with a novel visual-textual co-injection module designed to synergistically integrate these modalities into policy learning. To validate their individual and combined efficacy, we conduct a systematic ablation study (Table 4) on the LIBERO robotic task benchmark. Our key finding is that both textual and visual affordances are critical to model performance. Removing either modality leads to significant degradation in task success rates. While both modalities contribute uniquely, textual affordances exhibit stronger influence on policy optimization. We hypothesize that this stems from language’s inherent capacity to encode task-specific semantics (e.g., "pour-able" or "graspable"), which provides clearer optimization signals compared to visual features that require implicit spatial grounding. These results underscore the importance of our co-injection module, which dynamically balances and fuses multimodal affordances to maximize policy robustness in diverse environments.

7.2. Ablation Study on Dynamic Affordance Selection

Utilizing all affordances can be computationally expensive and time-consuming. Therefore, we introduce a dynamic affordance selection mechanism. This approach focuses on selectively utilizing only the most relevant affordances at each time step. As demonstrated in Table 5, our method outperforms a baseline model that employs all affordances indiscriminately. Surprisingly, using all affordances results in a lower average success rate compared to our dynamic selection approach. We hypothesize that the irrelevant affordances introduce noise during the optimization process, hindering the model's learning ability. To further analyze the impact of dynamic selection, we measured inference speed on an Nvidia 3090 GPU. We averaged the running time over all tasks, with each task measured across 5 trials. Our results show that utilizing all affordances significantly impacts inference speed, causing the model to run 6 times slower than our proposed method. This highlights the substantial efficiency gains achieved through dynamic affordance selection

7.3. Generalization to Unseen Object Pose

We assessed CoA-VLA’s ability to generalize to previously unseen object orientations, as illustrated in Figure 8. Our evaluation focused on two objects: a hammer and a teapot. In the training phase, both objects were consistently presented with their handles oriented vertically relative to the robot. To test the model's generalization capabilities, we introduced novel poses that were absent from the training data, challenging CoA-VLA to grasp these objects in unfamiliar orientations. We observed that CoA-VLA successfully managed most scenarios, demonstrating a remarkable ability to adapt to new object poses even without explicit training on these orientations. In contrast, OpenVLA succeeded only in the simplest cases, struggling with more complex orientations. However, when the objects were positioned horizontally relative to the robot, all models, including CoA-VLA, were unsuccessful in achieving a stable grasp. Despite this limitation, our grasp affordance approach shows promising results, enabling CoA-VLA to handle a wide range of novel object poses.

References

Show me a brief summary.

In this section, the references catalog the foundational and state-of-the-art works underpinning vision-language-action models and robotic manipulation research. The citations span several key areas: diffusion-based policy learning methods like DiffusionVLA and Diffusion Policy that enable visuomotor control through action diffusion; large-scale vision-language-action models such as OpenVLA, RT-2, and Octo that leverage web-scale data for generalist robotic policies; affordance-based approaches including grasp prediction networks like AnyGrasp and Contact-GraspNet, as well as spatial and relational affordance frameworks; reasoning techniques borrowed from large language models, particularly chain-of-thought and tree-of-thought prompting methods; benchmark datasets and tasks like LIBERO for evaluating imitation learning; and foundational computer vision tools such as Segment Anything, CoTracker, and Grounding DINO for perception. Together, these works establish the technical foundation for integrating affordance reasoning, multimodal perception, and policy learning to advance robotic manipulation capabilities across diverse, real-world scenarios.

[1] Junjie Wen, Minjie Zhu, Yichen Zhu, Zhibin Tang, Jinming Li, Chengmeng Li, Zhongyi Zhou, Xiaoyu Liu, Chaomin Shen, Yaxin Peng, and Feifei Feng. DiffusionVLA: Scaling robot foundation models via unified diffusion and autoregression. 2024a.

[2] Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems, 36, 2024a.

[3] Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023.

[4] Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. In Proceedings of Robotics: Science and Systems, Delft, Netherlands, 2024.

[5] Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. OpenVLA: An open-source vision-language-action model.

[6] Hao-Shu Fang, Chenxi Wang, Hongjie Fang, Minghao Gou, Jirong Liu, Hengxu Yan, Wenhai Liu, Yichen Xie, and Cewu Lu. Anygrasp: Robust and efficient grasp perception in spatial and temporal domains. IEEE Transactions on Robotics, 2023.

[7] Hao-Shu Fang, Chenxi Wang, Minghao Gou, and Cewu Lu. Graspnet-1billion: A large-scale benchmark for general object grasping. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11444–11453, 2020.

[8] Martin Sundermeyer, Arsalan Mousavian, Rudolph Triebel, and Dieter Fox. Contact-graspnet: Efficient 6-dof grasp generation in cluttered scenes. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13438–13444. IEEE, 2021a.

[9] Adithyavairavan Murali, Weiyu Liu, Kenneth Marino, Sonia Chernova, and Abhinav Gupta. Same object, different grasps: Data and semantic knowledge for task-oriented grasping. In Conference on robot learning, pages 1540–1557. PMLR, 2021.

[10] Weiyu Liu, Chris Paxton, Tucker Hermans, and Dieter Fox. Structformer: Learning spatial structure for language-guided semantic rearrangement of novel objects. In 2022 International Conference on Robotics and Automation (ICRA), pages 6322–6329. IEEE, 2022.

[11] Andy Zeng, Pete Florence, Jonathan Tompson, Stefan Welker, Jonathan Chien, Maria Attarian, Travis Armstrong, Ivan Krasin, Dan Duong, Vikas Sindhwani, et al. Transporter networks: Rearranging the visual world for robotic manipulation. In Conference on Robot Learning, pages 726–747. PMLR, 2021.

[12] Wentao Yuan, Adithyavairavan Murali, Arsalan Mousavian, and Dieter Fox. M2t2: Multi-task masked transformer for object-centric pick and place. arXiv preprint arXiv:2311.00926, 2023.

[13] Zhenyu Jiang, Yifeng Zhu, Maxwell Svetlik, Kuan Fang, and Yuke Zhu. Synergies between affordance and geometry: 6-dof grasp detection via implicit representations. arXiv preprint arXiv:2104.01542, 2021.

[14] Xiaoqi Li, Mingxu Zhang, Yiran Geng, Haoran Geng, Yuxing Long, Yan Shen, Renrui Zhang, Jiaming Liu, and Hao Dong. Manipllm: Embodied multimodal large language model for object-centric robotic manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18061–18070, 2024b.

[15] Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pretrained image-editing diffusion models. arXiv preprint arXiv:2310.10639, 2023b.

[16] Fangchen Liu, Kuan Fang, Pieter Abbeel, and Sergey Levine. Moka: Open-vocabulary robotic manipulation through mark-based visual prompting. In First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024b.

[17] Wenlong Huang, Chen Wang, Yunzhu Li, Ruohan Zhang, and Li Fei-Fei. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation. arXiv preprint arXiv:2409.01652, 2024c.

[18] Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13778–13790, 2023.

[19] Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024.

[20] Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, and Dieter Fox. Robopoint: A vision-language model for spatial affordance prediction for robotics. arXiv preprint arXiv:2406.10721, 2024.

[21] Soroush Nasiriany, Sean Kirmani, Tianli Ding, Laura Smith, Yuke Zhu, Danny Driess, Dorsa Sadigh, and Ted Xiao. Rt-affordance: Affordances are versatile intermediate representations for robot manipulation, 2024.

[22] Anonymous. TraceVLA: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. In Submitted to The Thirteenth International Conference on Learning Representations, 2024. under review.

[23] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.

[24] Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 17682–17690, 2024.

[25] Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.

[26] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024.

[27] Chengshu Li, Jacky Liang, Andy Zeng, Xinyun Chen, Karol Hausman, Dorsa Sadigh, Sergey Levine, Li Fei-Fei, Fei Xia, and Brian Ichter. Chain of code: Reasoning with a language model-augmented code emulator. arXiv preprint arXiv:2312.04474, 2023a.

[28] Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023.

[29] Haoxu Huang, Fanqi Lin, Yingdong Hu, Shengjie Wang, and Yang Gao. Copa: General robotic manipulation through spatial constraints of parts with foundation models. arXiv preprint arXiv:2403.08248, 2024a.

[30] Jiafei Duan, Wentao Yuan, Wilbert Pumacay, Yi Ru Wang, Kiana Ehsani, Dieter Fox, and Ranjay Krishna. Manipulate-anything: Automating real-world robots using vision-language models. arXiv preprint arXiv:2406.18915, 2024.

[31] Siyuan Huang, Haonan Chang, Yuhan Liu, Yimeng Zhu, Hao Dong, Peng Gao, Abdeslam Boularias, and Hongsheng Li. A3vlm: Actionable articulation-aware vision language model. arXiv preprint arXiv:2406.07549, 2024b.

[32] Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning. arXiv preprint arXiv:2407.08693, 2024.

[33] Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Max Li Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Ming-Yu Liu, Donglai Xiang, Gordon Wetzstein, and Tsung-Yi Lin. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models, 2024.

[34] Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.

[35] Abby O'Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, et al. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023.

[36] Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models. arXiv e-prints, pages arXiv–2501, 2025.

[37] Qingqing Zhao Yao Lu Moo Jin Kim Zipeng Fu Zhuoyang Zhang Yecheng Wu Max Li Qianli Ma Song Han Chelsea Finn Ankur Handa Ming-Yu Liu Donglai Xiang* Gordon Wetzstein* Tsung-Yi Lin*. Cot-vla visual chain-of-thought reasoning for vision-language-action models. 2024.

[38] Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Kun Wu, Zhiyuan Xu, Ran Cheng, Chaomin Shen, Yaxin Peng, Feifei Feng, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation. arXiv preprint arXiv:2409.12514, 2024b.

[39] Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators. arXiv preprint arXiv:2311.01378, 2023b.

[40] Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.

π0\pi_0

: A vision-language-action flow model for general robot control, 2024.

[41] Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023a.

[42] Fanqi Lin, Yingdong Hu, Pingyue Sheng, Chuan Wen, Jiacheng You, and Yang Gao. Data scaling laws in imitation learning for robotic manipulation, 2024.

[43] Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, Xiaofan Wang, Bei Liu, Jianlong Fu, Jianmin Bao, Dong Chen, Yuanchun Shi, Jiaolong Yang, and Baining Guo. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation, 2024a.

[44] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024.

[45] Martin Sundermeyer, Arsalan Mousavian, Rudolph Triebel, and Dieter Fox. Contact-graspnet: Efficient 6-dof grasp generation in cluttered scenes. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13438–13444. IEEE, 2021b.

[46] Zhixiang Xu, Gao Huang, Kilian Q Weinberger, and Alice X Zheng. Gradient boosted feature selection. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 522–531, 2014.

[47] Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024.

[48] Homanga Bharadhwaj, Jay Vakil, Mohit Sharma, Abhinav Gupta, Shubham Tulsiani, and Vikash Kumar. Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 4788–4795. IEEE, 2024.

[49] OpenAI. Chatgpt.

[50] S Liu, Z Zeng, T Ren, F Li, H Zhang, J Yang, C Li, J Yang, H Su, J Zhu, et al. Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499.

[51] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.

[52] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024.

[53] Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker: It is better to track together. arXiv preprint arXiv:2307.07635, 2023.

[54] Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos. arXiv preprint arXiv:2410.11831, 2024.

[55] Minjie Zhu, Yichen Zhu, Jinming Li, Junjie Wen, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, Yaxin Peng, Feifei Feng, et al. Scaling diffusion policy in transformer to 1 billion parameters for robotic manipulation. arXiv preprint arXiv:2409.14411, 2024.

[56] Lucy Xiaoyang Shi, Zheyuan Hu, Tony Z Zhao, Archit Sharma, Karl Pertsch, Jianlan Luo, Sergey Levine, and Chelsea Finn. Yell at your robot: Improving on-the-fly from language corrections. arXiv preprint arXiv:2403.12910, 2024.