Vision-Language-Action Models: Concepts, Progress, Applications and Challenges

Show me an executive summary.

# Executive Summary

Purpose and Context

Vision-Language-Action (VLA) models represent a transformative advancement in artificial intelligence, unifying visual perception, natural language understanding, and physical action within a single computational framework. Before VLA models emerged around 2021–2022, robotic and AI systems operated through fragmented pipelines—vision systems could recognize objects, language models could process text, and control systems could execute movements, but these capabilities functioned in isolation. This separation created brittle systems unable to generalize across tasks or adapt to novel environments. VLA models address this limitation by integrating all three modalities into end-to-end learnable architectures, enabling robots and autonomous agents to perceive their surroundings, interpret complex instructions, and execute appropriate actions dynamically. This review consolidates recent progress in VLA research, covering over 80 models published between 2022 and 2025, to provide researchers, engineers, and decision-makers with a comprehensive understanding of the field's current state, applications, and outstanding challenges.

Approach and Methodology

This review adopts a systematic literature analysis framework organized around five thematic pillars: conceptual foundations, architectural progress, training efficiency, real-world applications, and technical challenges. The conceptual foundation section traces VLA evolution from isolated modalities to unified agents, explaining how vision encoders (such as Vision Transformers), language models (such as LLaMA-2 or GPT variants), and action decoders are integrated through token-based representations. These tokens—prefix tokens encoding scene context and instructions, state tokens representing robot configuration, and action tokens specifying motor commands—are processed autoregressively, similar to text generation but producing physical actions instead of words.

The progress section examines architectural innovations including early fusion models that preserve semantic alignment from pretraining, dual-system architectures separating fast reactive control from slower deliberative planning, and self-correcting frameworks that detect and recover from failures autonomously. Training efficiency strategies are analyzed, including parameter-efficient methods like Low-Rank Adaptation (LoRA), quantization, model pruning, and inference acceleration techniques such as parallel decoding and compressed action tokenization. The applications section reviews deployment across six domains: humanoid robotics, autonomous vehicles, industrial automation, healthcare and medical robotics, precision agriculture, and augmented reality navigation. The challenges section identifies and categorizes limitations across real-time inference constraints, multimodal action representation, safety assurance, dataset bias, system integration complexity, computational demands, robustness to environmental variability, and ethical deployment considerations.

Main Findings

Architectural maturation: VLA models have evolved through three distinct phases. From 2022–2023, foundational models like CLIPort, RT-1, and VIMA established basic visuomotor coordination through multimodal fusion. In 2024, specialization emerged with domain-specific models incorporating 3D perception, tactile sensing, and memory-efficient architectures. By 2025, current systems prioritize generalization and safety-critical deployment, integrating formal verification, whole-body control for humanoids, and cross-embodiment transfer learning. Models like NVIDIA's Groot N1 demonstrate dual-system architectures combining 10ms-latency diffusion policies for low-level control with LLM-based planners for task decomposition, achieving 17% higher success rates than monolithic models on multi-stage household tasks.

Training efficiency breakthroughs: Co-fine-tuning on web-scale vision-language corpora (LAION-5B) and robotic trajectory datasets (Open X-Embodiment) enables strong generalization with fewer parameters—OpenVLA's 7-billion-parameter model outperforms 55-billion-parameter variants by 16.5% through this approach. Parameter-efficient adaptation via LoRA reduces trainable weights by 70% while maintaining performance, cutting GPU compute time from weeks to under 24 hours on commodity hardware. Quantization to 8-bit integers shrinks models by half with only 3–5% accuracy loss, enabling deployment on edge devices like Jetson platforms. Compressed action tokenization (FAST) achieves 15× faster inference by encoding control sequences as frequency-domain tokens, supporting 200 Hz policy rates critical for dexterous manipulation.

Application demonstrations: In humanoid robotics, systems like Helix perform full-body manipulation at 200 Hz, generalizing across unseen objects without task-specific retraining. Autonomous vehicle models (CoVLA, OpenDriveVLA, ORION) process multi-view sensor streams and natural language instructions to generate interpretable driving trajectories, achieving state-of-the-art planning accuracy and visual question-answering performance. Industrial VLAs such as CogACT outperform earlier models by 59% on real-world manipulation tasks through diffusion-based action modeling and rapid cross-embodiment adaptation. In healthcare, RoboNurse-VLA demonstrates real-time surgical instrument handover with robustness to tool novelty and dynamic operating room conditions. Agricultural applications show VLA-equipped robots achieving selective fruit picking with minimal crop damage and adaptive irrigation reducing water usage by 30%.

Persistent limitations: Real-time inference remains constrained—autoregressive decoding typically achieves only 3–5 Hz, far below the 100+ Hz required for precise robotic control. Parallel decoding methods like those in Groot N1 offer 2.5× speedups but introduce trajectory smoothness trade-offs unacceptable in sensitive applications. Multimodal action representation suffers from discrete tokenization imprecision (errors in 256-bin quantization schemes) or continuous MLP mode collapse, while diffusion-based alternatives incur 3× computational overhead. Safety mechanisms exhibit 200–500ms latency and collision prediction accuracy of only 82% in cluttered environments. Dataset bias affects approximately 17% of object associations, causing 23% reference-missing rates in novel settings. Generalization drops 40% on entirely novel tasks due to overfitting narrow training distributions. System integration faces temporal mismatches between 800ms LLM planning and 10ms control loops, and energy demands of 7-billion-parameter models (28+ GB VRAM) exceed edge hardware capacities.

Implications and Significance

These findings reveal that VLA models have transitioned from research prototypes to deployable systems in controlled environments, but significant gaps remain before widespread real-world adoption. The dual-system architecture breakthrough demonstrates that separating strategic reasoning from reactive control can substantially improve performance on complex tasks, suggesting this design pattern should guide future development. Training efficiency advances democratize VLA technology—smaller research groups and companies can now fine-tune billion-parameter models on consumer-grade GPUs, accelerating innovation outside well-resourced labs.

Application demonstrations prove VLAs can handle safety-critical domains like surgery and autonomous driving when properly validated, but the 82% collision prediction accuracy and 200–500ms emergency stop latency indicate current systems cannot yet meet the reliability standards required for unsupervised operation in high-stakes environments. The 40% performance degradation on novel tasks highlights that despite multimodal learning, today's VLAs still require substantial task-specific data, limiting their value proposition as "generalist" agents. The pervasive dataset bias affecting 17% of associations raises concerns about equitable deployment—systems trained predominantly on Western, urban datasets may fail or exhibit biased behavior when deployed in diverse global contexts.

From a cost perspective, training efficiency breakthroughs reduce development expenses from millions of dollars in cloud compute to tens of thousands for academic-scale projects, enabling broader participation. However, inference costs remain high—real-time VLA operation on embedded hardware requires specialized accelerators (Jetson, tensor cores) or model compression trade-offs that sacrifice accuracy. Organizations must weigh deployment context carefully: applications tolerating 3–5 Hz inference (warehouse sorting, agricultural monitoring) can use current VLAs profitably, while applications requiring 100+ Hz (surgical robotics, high-speed manipulation) need further algorithmic and hardware advances.

Recommendations and Next Steps

For immediate deployment (0–12 months): Organizations should deploy VLAs in applications tolerating moderate latency (3–10 Hz) and accepting 15–20% failure rates with human oversight: warehouse sorting, agricultural monitoring, simple pick-and-place tasks, and guided navigation in structured environments. Adopt dual-system architectures separating planning from control, use parameter-efficient fine-tuning (LoRA) to adapt pretrained models like OpenVLA to specific tasks with minimal data (reducing development time by 60–70%), and implement quantization for edge deployment where real-time inference is needed. Establish human-in-the-loop mechanisms for error recovery and continuous learning, and curate domain-specific validation datasets to audit for bias before deployment.

For medium-term development (1–3 years): Research priorities should focus on accelerating inference through model-architecture co-design with hardware manufacturers to achieve 100+ Hz on edge devices, developing hybrid action representations balancing discrete precision with continuous flexibility, and integrating formal verification methods to guarantee safety properties for critical applications. Expand cross-embodiment datasets and training methods to enable zero-shot transfer across robot morphologies, reducing per-platform development costs. Create standardized benchmarks for safety, robustness, and bias evaluation specific to VLA systems, as current metrics inadequately capture real-world deployment requirements. Pilot VLA systems in semi-autonomous modes in healthcare (surgical assistance with surgeon oversight) and transportation (driver-assist rather than full autonomy) to build reliability data and regulatory acceptance.

For long-term vision (3+ years): The field should pursue convergence of VLAs with agentic AI systems capable of self-supervised continual learning—robots generating their own exploration objectives and improving skills autonomously over deployment lifetimes. Develop hierarchical neuro-symbolic planning integrating interpretable task decomposition with learned motor primitives, enabling transparent decision-making required for regulated domains. Advance world models providing real-time predictive simulation to enable model-based corrective actions during unexpected events. Establish comprehensive regulatory frameworks addressing VLA safety certification, liability allocation, and ethical deployment standards before large-scale autonomous operation. Foster cross-disciplinary collaboration between robotics, AI ethics, human factors, and domain experts (surgeons, farmers, logistics operators) to ensure VLA systems augment rather than displace human expertise.

Decision points requiring attention: Organizations must decide whether to build proprietary VLA systems or adopt open-source foundations like OpenVLA—the latter reduces initial investment but may limit competitive differentiation. Policymakers should determine appropriate oversight mechanisms balancing innovation enablement with public safety, particularly for VLAs in healthcare, transportation, and public spaces. Researchers must choose between pursuing incremental improvements to existing transformer-based architectures versus exploring alternative paradigms (state-space models, neuromorphic computing) that might offer step-change efficiency gains.

Limitations and Confidence

This review's primary limitation is its literature-based methodology—findings reflect published research, which may lag proprietary industrial developments by 6–18 months, particularly from well-resourced companies (Google DeepMind, NVIDIA, Tesla). Performance metrics across studies vary in evaluation protocols, making direct comparisons imprecise; stated success rates should be interpreted as indicative trends rather than precise benchmarks. The rapid pace of VLA development (47 major models in three years) means conclusions may be quickly superseded by new architectures or training methods.

Confidence in architectural findings is high—the convergence toward dual-system designs and token-based representations across independent research groups indicates robust design principles. Confidence in application readiness is moderate for structured industrial settings but low for unstructured real-world deployment, as most reported results come from controlled laboratory or simulation environments rather than extended field trials. The gap between simulation and real-world performance typically ranges from 15–35% across robotics applications.

Confidence in the identified challenges is high—real-time inference constraints, safety limitations, and dataset bias are consistently reported across diverse research groups and domains. However, confidence in proposed solutions is moderate to low, as most represent emerging research directions rather than validated approaches. Parameter-efficient training methods (LoRA, quantization) are well-established with reproducible results, but advanced concepts like neuro-symbolic planning and cross-embodiment generalization remain largely aspirational.

Readers should exercise caution when extrapolating laboratory success rates to production environments—expect 20–40% degradation without extensive domain-specific validation. Safety-critical applications (surgery, autonomous driving) should not rely on current VLA systems without redundant safeguards and human oversight, as failure modes remain incompletely characterized. Organizations evaluating VLA adoption should conduct pilot deployments in low-risk scenarios before scaling, allocating 30–50% of project budgets to dataset curation, bias auditing, and failure-mode analysis rather than model development alone.

Ranjan Sapkota

^{a}

, Yang Cao

^{b}

, Konstantinos I. Roumeliotis

^{c}

, Manoj Karkee

^{a}

^{a}

Cornell University, Biological & Environmental Engineering, Ithaca, New York, USA

^{b}

The Hong Kong University of Science and Technology, Department of Computer Science and Engineering, Hong Kong

^{c}

University of the Peloponnese, Department of Informatics and Telecommunications, Greece

*Ranjan Sapkota
Email address: [email protected] (Manoj Karkee)

Abstract

Vision-Language-Action (VLA) models mark a transformative advancement in artificial intelligence, aiming to unify perception, natural language understanding, and embodied action within a single computational framework. This foundational review presents a comprehensive synthesis of recent advancements in Vision-Language-Action models, systematically organized across five thematic pillars that structure the landscape of this rapidly evolving field. We begin by establishing the conceptual foundations of VLA systems, tracing their evolution from cross-modal learning architectures to generalist agents that tightly integrate vision-language models (VLMs), action planners, and hierarchical controllers. Our methodology adopts a rigorous literature review framework, covering over 80 VLA models published in the past three years. Key progress areas include architectural innovations, parameter-efficient training strategies, and real-time inference accelerations. We explore diverse application domains such as humanoid robotics, autonomous vehicles, medical and industrial robotics, precision agriculture, and augmented reality navigation. The review further addresses major challenges across real-time control, multimodal action representation, system scalability, generalization to unseen tasks, and ethical deployment risks. Drawing from the state-of-the-art, we propose targeted solutions including agentic AI adaptation, cross-embodiment generalization, and unified neuro-symbolic planning. In our forward-looking discussion, we outline a future roadmap where VLA models, VLMs, and agentic AI converge to power socially aligned, adaptive, and general-purpose embodied agents. This work serves as a foundational reference for advancing intelligent, real-world robotics and artificial general intelligence.

Keywords: Vision-Language-Action, VLA, Artificial Intelligence, Robotics, Vision-Language Models, AI Agents, Agentic AI

1. Introduction

Show me a brief summary.

In this section, the fundamental limitation of pre-VLA artificial intelligence is exposed: vision, language, and action systems operated in isolation, unable to collaborate or generalize beyond narrow tasks, leaving robots incapable of flexible real-world behavior. Traditional computer vision models could recognize objects but not understand language or execute actions; language models processed text without perceiving the physical world; and action-based robotics required hand-crafted policies that failed to adapt. Even vision-language models, despite achieving impressive multimodal understanding, lacked the critical ability to generate executable actions, resulting in fragmented pipelines that could not unify perception, reasoning, and control. Vision-Language-Action models emerged around 2021-2022, pioneered by systems like RT-2, to bridge this gap by integrating all three modalities into a single end-to-end framework using action tokens and internet-scale multimodal datasets. This breakthrough enables robots to perceive environments, interpret natural language instructions, and execute adaptive actions, representing a transformative step toward truly generalizable embodied intelligence.

Before Vision-Language-Action (VLA) models were developed, progress in robotics and artificial intelligence happened mostly in separate areas: vision systems that could see and recognize images [1, 2], language systems that could understand and generate text [3, 4], and action systems that could control movement [5]. These systems worked well on their own, but they struggled to work together or handle new and unpredictable situations [6, 7]. As a result, ‘erstand complex environments or respond flexibly to real-world challenges.

As illustrated in Figure 1, traditional computer vision models primarily based on convolutional neural networks (CNNs), were tailored for narrowly specified tasks such as object detection or classification, requiring extensive labeled datasets and cumbersome retraining for even slight shifts in environment or objectives [8, 9]. These vision models could “see” (e.g., identifying apples in an orchard, as shown in Figure 1) but lacked any understanding of language or the ability to convert visual insights into purposeful actions. Language models, particularly large language models (LLMs), revolutionized text-based understanding and generation [10]; however, they remained restricted to processing language without the capability to perceive or reason about the physical world [11] ("Ripe apples in orchard" in Figure 1 exemplifies this limitation). Meanwhile, action-based systems in robotics, relying heavily on hand-crafted policies or reinforcement learning [12], enabled specific behaviors like object manipulation but demanded painstaking engineering and failed to generalize beyond narrowly scripted scenarios [13].

Despite progress with VLMs, which achieved impressive multimodal understanding by combining vision and language [14, 15, 16], there remained a conspicuous integration gap: the inability to generate or execute coherent actions based on multimodal input [17, 18]. As further visualized in Figure 1, most AI systems specialized at most in two modalities—vision-language, vision-action, or language-action—but struggled to fully integrate all three into a unified, end-to-end framework. Consequently, robots could recognize objects visually ("apple"), understand a corresponding textual instruction ("pick the apple"), or perform a predefined motor action (grasping), yet orchestrating these abilities into fluid, adaptable behavior was beyond reach. The result was a fragmented pipeline architecture that could not flexibly adapt to new tasks or environments, leading to brittle generalization and labor-intensive engineering efforts. This highlighted a critical bottleneck in embodied AI: without systems that could jointly perceive, understand, and act, intelligent autonomous behavior remained an elusive goal.

The pressing need to bridge these gaps catalyzed the emergence of VLA models. VLA models, conceptualized around 2021-2022, and pioneered by efforts such as Google DeepMind’s Robotic Transformer 2 (RT-2) [19], introduced a transformative architecture that unified perception, reasoning, and control within a single framework. As a solution to the limitations outlined in Figure 1, VLAs integrate vision inputs, language comprehension, and motor control capabilities, enabling embodied agents to perceive their surroundings, understand complex instructions, and execute appropriate actions dynamically. Early VLA approaches achieved this integration by extending vision-language models to include action tokens—numerical or symbolic representations of robot motor commands, thereby allowing the model to learn from paired vision, language, and trajectory data [17]. This methodological innovation dramatically improved robots' ability to generalize to unseen objects, interpret novel language commands, and perform multi-step reasoning in unstructured environments [20].

VLA models represent a transformative step in the pursuit of unified multimodal intelligence, overcoming the long-standing limitations of treating vision, language, and action as separate domains [17]. By leveraging internet-scale datasets that integrate visual, linguistic, and behavioral information, VLAs empower robots to not only recognize and describe their environments but also to reason contextually and execute appropriate actions in complex, dynamic settings [21]. The progression illustrated in Figure 1 from isolated vision, language, and action systems to an integrated VLA paradigm-captures a fundamental shift toward the development of truly adaptive and generalizable embodied agents. Given the profound implications of this innovation, it is crucial to undertake a thorough and systematic review that draws from a comprehensive body of literature and critical analysis. First, such a review is necessary to clarify the foundational concepts and architectural principles that distinguish VLAs from their predecessors. Second, it provides a structured account of the rapid progress and key milestones in the field, enabling researchers and practitioners to appreciate the trajectory of technological advancements. Third, an in-depth review is essential for mapping the diverse range of real-world applications-from household robotics to industrial automation and assistive technologies-where VLAs are already demonstrating transformative potential. Furthermore, by critically examining the current challenges, such as data efficiency, safety, generalization, and ethical considerations, the review identifies barriers that must be addressed for widespread deployment. And Fifth, synthesizing these insights helps to inform the broader AI and robotics communities about emerging research directions and practical considerations, fostering collaboration and innovation.

In this review, we systematically analyze the foundational principles, developmental progress, and technical challenges associated with VLA models. Our objective is to consolidate the current understanding of VLAs while identifying limitations and proposing future directions for their evolution. The review begins with a detailed examination of key conceptual foundations (Figure 2), including what constitutes a VLA model, its historical evolution, multimodal integration mechanisms, and language-based tokenization and encoding strategies. These conceptual components set the stage for understanding how VLAs are structured and function across modalities.

Building upon this, we present a unified view of recent progress and training efficiency strategies (Figure 3). This includes architectural innovations that have enabled more capable and generalizable VLA models, as well as data-efficient learning frameworks, parameter-efficient modeling techniques, and model acceleration strategies designed to reduce computational overhead without compromising performance. These advancements are critical for scaling VLA systems to real-world applications.

Following this, we delve into a comprehensive discussion of the current limitations faced by VLA systems (Figure 4). These include inference bottlenecks, safety concerns, high computational demands, limited generalization, and ethical implications. We not only highlight these pressing challenges but also provide an analytical discussion of potential solutions to address them.

Together, these three figures offer a visual framework that supports the textual analysis of this review. By outlining the conceptual landscape, recent innovations, and open challenges, this work aims to guide future research and encourage the development of more robust, efficient, and ethically grounded VLA systems.

2. Concepts of Vision-Language-Action Models

Show me a brief summary.

In this section, VLA models emerge as unified AI frameworks that overcome the fragmentation of isolated vision, language, and action systems by jointly processing visual inputs, natural language instructions, and motor control within a single end-to-end architecture. Unlike traditional pipelines requiring manual interfaces and domain-specific engineering, VLAs leverage multimodal integration through pretrained encoders and transformer-based fusion to enable context-aware reasoning and generalization across novel tasks. Their token-based representation framework unifies perceptual, linguistic, and action spaces using prefix tokens for contextual grounding, state tokens for proprioceptive awareness, and autoregressively generated action tokens for precise control execution. Training combines internet-scale vision-language data with robotic demonstrations through imitation learning, reinforcement learning, and retrieval-augmented methods, while adaptive control mechanisms enable real-time behavioral adjustments via continuous sensor feedback. This paradigm shift transforms robots from brittle, task-specific machines into flexible, semantically grounded agents capable of interpreting complex instructions and executing dynamic manipulations in unstructured environments.

VLA models represent a new class of intelligent systems that jointly process visual inputs, interpret natural language, and generate executable actions in dynamic environments. Technically, VLAs combine vision encoders (e.g., CNNs, ViTs), language models (e.g., LLMs, transformers), and policy modules or planners to achieve task-conditioned control. These models typically employ multimodal fusion techniques—such as cross-attention, concatenated embeddings, or token unification—to align sensory observations with textual instructions. Unlike traditional visuomotor pipelines, VLAs support semantic grounding, enabling context-aware reasoning, affordance detection, and temporal planning. A typical VLA model observes the environment through camera or sensor data, interprets goals expressed in language (e.g., “pick up the red apple”) (Figure 5), and outputs low-level or high-level action sequences. Recent advancements integrate imitation learning, reinforcement learning, or retrieval-augmented modules to improve sample efficiency and generalization. This review examines how VLA models have evolved from foundational fusion architectures to general-purpose agents capable of real-world deployment across robotics, navigation, and human-AI collaboration.

VLA models are multimodal artificial intelligence systems that unify visual perception, language comprehension, and physical action generation into a single framework. These models enable robots or AI agents to interpret sensory inputs (e.g., images, text), understand contextual meaning, and autonomously execute tasks in real-world environments-all through end-to-end learning rather than isolated subsystems. As shown conceptually in Figure 5, VLA models bridge the historical disconnect between visual recognition, language comprehension, and motor execution that limited the capabilities of earlier robotic and AI systems.

2.1. Evolution and Timeline

The rapid development of VLA models from 2022-2025 demonstrates three distinct evolutionary phases:

Foundational Integration (2022–2023). Early VLAs established basic visuomotor coordination through multimodal fusion architectures. [22] first combined CLIP embeddings with motion primitives, while [23] demonstrated generalist capabilities across 604 tasks. [24] achieved 97% success rates in manipulation through scaled imitation learning, and [25] introduced temporal reasoning via transformer-based planners. By 2023, [19] enabled visual chain-of-thought reasoning, and [26] advanced stochastic action prediction through diffusion processes. These foundations addressed low-level control but lacked compositional reasoning [27], prompting innovations in affordance grounding [28].
Specialization and Embodied Reasoning (2024). Second-generation VLAs incorporated domain-specific inductive biases. [29] enhanced few-shot adaptation through retrieval-augmented training, while [30] optimized navigation via 3D scene-graph integration. [31] introduced reversible architectures for memory efficiency, and [32] addressed partial observability with physics-informed attention. Simultaneously, [33] improved compositional understanding through object-centric disentanglement, and [34] extended applications to autonomous driving via multi-modal sensor fusion. These advances required new benchmarking methodologies [21].
Generalization and Safety-Critical Deployment (2025). Current systems prioritize robustness and human alignment. [35] integrated formal verification for risk-aware decisions, while [36] demonstrated whole-body control through hierarchical VLAs. [37] optimized compute efficiency for embedded deployment, and [38] combined neural-symbolic reasoning for causal inference. Emerging paradigms like [39]'s affordance chaining and [40]'s sim-to-real transfer learning address cross-embodiment challenges, while [41] bridges VLAs with human-in-the-loop interfaces through natural language grounding.

Figure 6 presents a comprehensive timeline highlighting the evolution of 47 VLA models developed between 2022 and 2025. The earliest VLA systems, including CLIPort [22], Gato [23], RT-1 [24], and VIMA [25], laid the foundation by combining pretrained vision-language representations with task-conditioned policies for manipulation and control. These were followed by ACT [27], RT-2 [19], and VoxPoser [28], which integrated visual chain-of-thought reasoning and affordance grounding. Models like Diffusion Policy [26] and Octo [42] introduced stochastic modeling and scalable data pipelines. In 2024, systems such as Deer-VLA [29], ReVLA [31], and Uni-NaVid [30] added domain specialization and memory-efficient designs, while Occllama [32] and ShowUI [41] tackled partial observability and user interaction. The trajectory continued with robotics-focused VLAs like Quar-VLA [43] and RoboMamba [44]. Recent innovations emphasize generalization and deployment: SafeVLA [35], Humanoid-VLA [36], and MoManipVLA [45] incorporate verification, full-body control, and memory systems. Models such as Gr00t N1 [40] and SpatialVLA [46] further bridge sim-to-real transfer and spatial grounding. This timeline illustrates how VLAs have advanced from modular learning to general-purpose, safe, and embodied intelligence.

2.2. Multimodal Integration: From Isolated Pipelines to Unified Agents

A central advancement in the emergence of VLA models lies in their ability to perform multimodal integration, the joint processing of vision, language, and action within a unified architecture. Traditional robotic systems treated perception, natural language understanding, and control as discrete modules, often linked through manually defined interfaces or data transformations [47, 48, 49]. For instance, classic pipeline-based frameworks required a perception model to output symbolic labels, which were then mapped by a planner to specific actions—frequently with domain-specific hand engineering [50, 51]. These approaches lacked adaptability, failed in ambiguous or novel environments, and could not generalize instructions beyond pre-encoded templates.

In contrast, modern VLAs fuse modalities end-to-end using large-scale pretrained encoders and transformer-based architectures [52]. This shift enables the model to interpret visual observations and linguistic instructions within the same computational space, allowing flexible, context-aware reasoning [53]. For example, in the task “Pick up the red ripe apple, ” (Figure 5) the vision encoder—typically a Vision Transformer (ViT) or ConvNeXt—segments and classifies objects in the scene (e.g., apples, leaves, background), identifying color and ripeness attributes [54]. Meanwhile, the language model, often a variant of T5, GPT, or BERT, encodes the instruction into a high-dimensional embedding. These representations are then fused via cross-attention or joint tokenization schemes, producing a unified latent space that informs the action policy [55].

This multimodal synergy was first effectively demonstrated in CLIPort [22], which used CLIP embeddings for semantic grounding and a convolutional decoder for pixel-level manipulation. CLIPort bypassed the need for explicit language parsing and directly conditioned visuomotor policies on natural language. Similarly, VIMA [25] advanced this approach by employing a transformer encoder to jointly process object-centric visual tokens and instruction tokens, enabling few-shot generalization across spatial reasoning tasks.

Recent developments push this fusion further by incorporating temporal and spatial grounding. VoxPoser [28] employs voxel-level reasoning to resolve ambiguities in 3D object selection, while RT-2 [19] fuses visual-language tokens into a unified transformer that supports zero-shot generalization to unseen instructions. Another noteworthy contribution is Octo [42], which introduces a memory-augmented transformer that enables long-horizon decision-making across diverse scenes, demonstrating the scalability of joint perception-language-action learning.

Crucially, VLAs offer robust solutions to challenges in real-world grounding. For example, Occllama [32] handles occluded object references through attention-based mechanisms, while ShowUI [41] demonstrates natural language interfaces that allow non-expert users to command agents through voice or typed input. These capabilities are only possible because the integration is not limited to surface-level fusion; rather, it captures semantic, spatial, and temporal alignment across modalities.

2.3. Tokenization and Representation: How VLAs Encode the World

A core innovation that sets VLA models apart from conventional vision-language architectures lies in their token-based representation framework, which enables holistic reasoning over perceptual [56, 57], linguistic, and physical action spaces [58]. Inspired by autoregressive generative models like transformers, modern VLAs encode the world using discrete tokens that unify all modalities—vision, language, state, and action into a shared embedding space [59]. This allows the model to not only understand “what needs to be done” (semantic reasoning), but also “how to do it” (control policy execution) in a fully learnable and compositional way [60, 61, 62].

**Prefix Tokens: Encoding Context and Instruction:**Prefix tokens serve as the contextual backbone of VLA models [63, 20]. These tokens encode the environmental scene (via images or video) and the accompanying natural language instruction into compact embeddings that prime the model's internal representations [64]. {#fig_prefix}

For instance, as depicted in Figure 7 in a task such as “stack the green blocks on the red tray, ” the image of a cluttered tabletop is processed through a vision encoder like ViT or ConvNeXt, while the instruction is embedded by a large language model (e.g., T5 or LLaMA). These are then transformed into a sequence of prefix tokens that establish the model’s initial understanding of the goal and environmental layout. This shared representation enables cross-modal grounding, allowing the system to resolve spatial references (e.g., “on the left, ” “next to the blue cup”) and object semantics (“green blocks”) across both modalities.

State Tokens: Embedding the Robot's Configuration: In addition to perceiving external stimuli, VLAs must be aware of their internal physical state [65, 44]. This is achieved through the use of state tokens, which encode real-time information about the agent's configuration—joint positions, force-torque readings, gripper status, end-effector pose, and even the locations of nearby objects [66]. These tokens are crucial for ensuring situational awareness and safety, especially during manipulation or locomotion [67, 68].

Figure 8 illustrates how VLA models utilize state tokens to enable dynamic, context-aware decision-making in both manipulation and navigation settings. In Figure 8 a, a robot arm is shown partially extended near a fragile object. In such scenarios, state tokens play a critical role by encoding real-time proprioceptive information, such as joint angles, gripper pose, and end-effector proximity. These tokens are continuously fused with visual and language-based prefix tokens, allowing the transformer to reason about physical constraints. The model can thus infer that a collision is imminent and adjust the motor commands accordingly—e.g., rerouting the arm trajectory or modulating force output. In mobile robotic platforms, as depicted in Figure 8 b, state tokens encapsulate spatial features such as odometry, LiDAR scans, and inertial sensor data. These are essential for terrain-aware locomotion and obstacle avoidance. The transformer model integrates this state representation with environmental and instructional context to generate navigation actions that dynamically adapt to changing surroundings. Whether grasping objects in cluttered environments or autonomously navigating uneven terrain, state tokens provide a structured mechanism for situational awareness, enabling the autoregressive decoder to produce precise, context-informed action sequences that reflect both internal robot configuration and external sensory data.
Action Tokens: Autoregressive Control Generation: The final layer of the VLA token pipeline involves action tokens [69, 70], which are autoregressively generated by the model to represent the next step in motor control [65]. Each token corresponds to a low-level control signal, such as joint angle updates, torque values, wheel velocities, or high-level movement primitives [71]. During inference, the model decodes these tokens one step at a time, conditioned on prefix and state tokens, effectively turning VLA models into language-driven policy generators [72, 73]. This formulation allows seamless integration with real-world actuation systems, supports variable-length action sequences [74, 75], and enables model fine-tuning via reinforcement or imitation learning frameworks [76]. Notably, models like RT-2 [19] and PaLM-E [77] exemplify this design, where perception, instruction, and embodiment are merged into a unified token stream.

For instance, in the apple-picking task as depicted in Figure 9, the model may receive prefix tokens that include the image of the orchard and the text instruction. The state tokens describe the robot's current arm posture and whether the gripper is open or closed. Action tokens are then predicted step by step to guide the robotic arm toward the apple, adjust the gripper orientation, and execute a grasp with appropriate force. The beauty of this approach is that it allows transformers, which are traditionally used for text generation, to now generate sequences of physical actions in a manner similar to generating a sentence—only here, the sentence is the motion.

To operationalize the VLA paradigm in robotics, we present in Figure 9 a structured pipeline that demonstrates how multimodal information—specifically vision, language, and proprioceptive state—is encoded, fused, and converted into executable action sequences. This end-to-end loop allows a robot to interpret complex tasks like “pick the ripe apple near the green leaf” and execute precise, context-sensitive manipulations. The system begins with multimodal input acquisition, where three distinct data streams are collected: visual observations (e.g., RGB-D frames), natural language commands, and real-time robot state information (e.g., joint angles or velocity). These are independently tokenized into discrete embeddings using pretrained modules [78, 79]. As depicted in the diagram, the image is processed through a Vision Transformer (ViT) backbone to generate vision tokens, the instruction is parsed by a language model such as BERT or T5 to produce language tokens, and state inputs are transformed via a lightweight MLP encoder into compact state tokens.

These tokens are then fused using a cross-modal attention mechanism, where the model jointly reasons over object semantics, spatial layout, and physical constraints [80]. This fused representation forms the contextual basis for decision-making [81, 82]. In Figure 9, this is denoted as the multimodal fusion step. The fused embedding is passed into an autoregressive decoder—typically a transformer—that generates a series of action tokens. These tokens may correspond to joint displacements, gripper force modulation, or high-level motor primitives (e.g., “move to grasp pose”, “rotate wrist”). The action tokens are subsequently translated into control commands and passed to the execution loop, which closes the perception-action cycle by feeding back the robot’s updated state, thus informing the next inference step. This closed-loop mechanism enables the model to dynamically adapt to perturbations, object shifts, or occlusions in real time [83, 84, 85].

To offer concrete implementation details, Algorithm 1 formalizes the VLA tokenization process. Given an RGB-D frame

I

, natural language instruction

T

, and joint angle vector

θ\theta

, the algorithm produces a set of action tokens that can be executed in sequence. The image

I

is processed via a ViT to produce

V

, a set of 400 visual tokens. In parallel, the instruction

T

is encoded by a BERT model to yield

L

, a sequence of 12 semantic language tokens. Simultaneously, robot state

θ\theta

is passed through a multilayer perceptron to generate a 64-dimensional state embedding

S

. These tokens are then fused via a cross-attention module to produce a shared 512-dimensional representation

F

, capturing the semantics, intent, and situational awareness needed for grounded action. Finally, a policy decoder such as FAST [86] maps the fused features into 50 discrete action tokens, which can then be decoded into motor commands

τ1:N\tau_{1:N}

The decoding process is implemented using a transformer-based architecture, as shown in the code snippet titled Action Prediction Code. A Transformer object is initialized with 12 layers, a model dimension of 512, and 8 attention heads. The fused tokens are passed to the decoder, which autoregressively predicts the next most likely action token conditioned on previous tokens and context. The final motor command sequence is obtained by detokenizing the output. This implementation mirrors how text generation works in large language models, but here the “sentence” is a motion trajectory—a novel repurposing of natural language generation techniques for physical action synthesis.

Together, Figure 9, Algorithm 1, and the pseudocode illustrate how VLAs unify perception, instruction, and embodiment within a coherent and interpretable token space. This modularity allows the framework to generalize across tasks and robot morphologies, facilitating rapid deployment in real-world applications like apple picking, household tasks, and mobile navigation. Importantly, the clarity and separability of the tokenization steps make the architecture extensible, enabling further research on token learning, hierarchical planning, or symbolic grounding in VLA systems.

Algorithm 1: VLA Tokenization Pipeline

1Input: RGB-D frame III, text command TTT, joint angles θ\thetaθ
2V←ViT(I)V \gets \text{ViT}(I)V←ViT(I) ▷ 400 vision tokens
3L←BERT(T)L \gets \text{BERT}(T)L←BERT(T) ▷ 12 language tokens
4S←MLP(θ)S \gets \text{MLP}(\theta)S←MLP(θ) ▷ 64-dim state encoding
5F←CrossAttention(V,L,S)F \gets \text{CrossAttention}(V, L, S )F←CrossAttention(V,L,S) ▷ 512-dim fused token
6A←FAST(F)A \gets \text{FAST}(F)A←FAST(F) ▷ 50 action tokens
7Output: Motor commands τ1:N\tau_{1:N}τ1:N​

Action Prediction Code

# Python-like pseudocode
def predict_actions(fused_tokens):
    transformer = Transformer(
        num_layers=12,
        d_model=512,
        nhead=8
    )
    action_tokens = transformer.decode(
        fused_tokens,
        memory=fused_tokens

2.4. Learning Paradigms: Data Sources and Training Strategies

Training VLA models requires a hybrid learning paradigm that integrates both semantic knowledge from the web and task-grounded information from robotics datasets [87]. As shown in prior sections, the multimodal architecture of VLAs must be exposed to diverse forms of data that support language understanding, visual recognition, and motor control. This is typically achieved through two primary data sources.

First, as depicted in Figure 10, large-scale internet-derived corpora form the backbone of the model’s semantic prior. These datasets include image-caption pairs (e.g., COCO, LAION-400M), instruction-following datasets (e.g., HowTo100M, WebVid), and visual question-answering corpora (e.g., VQA, GQA). Such datasets enable pretraining of the visual and language encoders, helping the model acquire general representations of objects, actions, and concepts [88]. This phase often uses contrastive or masked modeling objectives, such as CLIP-style contrastive learning or language modeling losses, to align vision and language modalities within a shared embedding space [89, 90]. Importantly, this stage gives VLAs a foundational “understanding of the world” that facilitates compositional generalization, object grounding, and zero-shot transfer [91, 92].

However, semantic understanding alone is insufficient for physical task execution [93, 94, 18]. Thus, the second phase focuses on grounding the model in embodied experience [94]. Robot trajectory datasets—collected either from real-world robots or high-fidelity simulators—are used to teach the model how language and perception translate into action [72]. These include datasets like RoboNet [95], BridgeData [96], and RT-X [97], which provide video-action pairs, joint trajectories, and environment interactions under natural language instructions [98]. Demonstration data may come from kinesthetic teaching, teleoperation, or scripted policies [99, 100]. This phase typically employs supervised learning (e.g., behavior cloning) [101], reinforcement learning (RL), or imitation learning to train the autoregressive policy decoder to predict action tokens based on fused visual-language-state embeddings [102].

Recent works increasingly adopt multistage or multitask training strategies. For example, models are often pretrained on vision-language datasets using masked language modeling, then fine-tuned on robot demonstration data using token-level autoregressive loss [70, 103, 63]. Others use curriculum learning, where simpler tasks (e.g., object pushing) precede more complex ones (e.g., multistep manipulation) [104]. Some approaches further leverage domain adaptation such as in OpenVLA [70] or sim-to-real transfer to bridge the gap between synthetic and real-world distributions [105]. By unifying semantic priors with task execution data, these learning paradigms allow VLA models to generalize across tasks, domains, and embodiments—forming the backbone of scalable, instruction-following agents capable of robust real-world operation.

Through co-fine-tuning, these datasets are brought into alignment [106, 107]. The model learns to map from visual and linguistic inputs to appropriate action sequences [46]. This training paradigm not only helps the model understand object affordances (e.g., apples can be grasped) and action outcomes (e.g., lifting requires force and trajectory), but also promotes generalization to novel scenarios [39]. A model trained on kitchen manipulation tasks may be able to infer how to pick an apple in an outdoor orchard if it has learned general principles of object localization, grasping, and following language directives.

Recent architectures, such as Google DeepMind’s RT-2 (Robotic Transformer 2) [19], have demonstrated this principle in action. RT-2 treats action generation as a form of text generation, where each action token corresponds to a discrete command in a robot’s control space. Because the model is trained on both web-scale multimodal data and thousands of robot demonstrations, it can flexibly interpret novel instructions and perform zero-shot generalization to new objects and tasks—something that was largely impossible with traditional control systems or even early multimodal models.

2.5. Adaptive Control and Real-Time Execution

Another strength of VLAs lies in their ability to perform adaptive control, using real-time feedback from sensors to adjust behavior on the fly [108]. This is particularly important in dynamic, unstructured environments like orchards, homes, or hospitals, where unexpected changes (e.g., wind moving an apple, lighting changes, human presence) can alter the task parameters. During execution, state tokens are updated in real time, reflecting sensor inputs and joint feedback [63]. The model can then revise its planned actions accordingly. For instance, in the apple-picking scenario, if the target apple shifts slightly or another apple enters the field of view, the model dynamically reinterprets the scene and adjusts the grasp trajectory. This capability mimics human-like adaptability and is a core advantage of VLA systems over pipeline-based robotics.

3. Progress in Vision-Language-Action Models

Show me a brief summary.

In this section, Vision-Language-Action (VLA) models emerged from the convergence of large language models like ChatGPT and multimodal vision-language systems such as CLIP, which established robust visual-text alignment through contrastive learning on web-scale datasets. The creation of large robotic datasets, notably RT-1's 130,000 demonstrations, enabled action-grounding essential for training models that unify perception, language, and motor control. Architectural innovations followed rapidly: RT-2 pioneered autoregressive action token generation using transformer decoders, while dual-system architectures like NVIDIA's Groot N1 separated fast reactive control from strategic planning. Models evolved from early fusion designs that preserve CLIP alignment to self-correcting frameworks capable of failure recovery through chain-of-thought reasoning. Training paradigms shifted toward co-fine-tuning on internet-scale vision-language corpora and robotic trajectory data, with techniques like LoRA adapters reducing computational costs by 70%. These advances democratized VLA technology, enabling generalization across tasks, embodiments, and domains while balancing real-time execution with high-level cognitive planning.

Show me a brief summary.

In this section, the references catalog a comprehensive body of work spanning foundational vision-language research, neural architectures, and robotics applications that underpin the development of Vision-Language-Action (VLA) models. Early studies established multimodal integration through recurrent convolutional networks for visual recognition and description, while generative pre-training and transformer architectures enabled scalable language understanding. Subsequent advances introduced robot-specific policies like RT-1, RT-2, and VIMA, which unified visual perception, natural language instructions, and action execution through end-to-end learning. The literature also addresses critical challenges including real-time inference optimization via quantization and low-rank adaptation, generalization through cross-embodiment transfer and meta-learning, and ethical concerns such as bias in vision-language datasets and safe human-robot interaction. Collectively, these works trace the evolution from isolated perception-action modules to integrated, instruction-following agents capable of operating in complex, unstructured environments, while highlighting ongoing efforts to enhance efficiency, robustness, and societal alignment in embodied AI systems.

The inception of VLA models was catalyzed by the remarkable success of transformer-based LLMs, notably ChatGPT, released in November 2022, which demonstrated unprecedented semantic reasoning capabilities (ChatGPT) [109]. This breakthrough inspired researchers to extend language models to multimodal domains, integrating perception and action for robotics. By 2023, GPT-4 introduced multimodal capabilities, processing both text and images, which spurred efforts to incorporate physical actions (GPT-4) [110]. Concurrently, VLMs like CLIP (2022) [22] and Flamingo (2022) [111] had established robust visual-text alignment through contrastive learning, enabling zero-shot object recognition and laying the groundwork for VLA models (CLIP). These models leveraged large-scale web datasets to align images with textual descriptions, a critical precursor to integrating actions.

A pivotal development was the creation of large-scale robotic datasets, such as RT-1’s 130, 000 demonstrations, which provided action-grounding data essential for co-training vision, language, and action components [24]. These datasets captured diverse tasks and environments, enabling models to learn generalizable behaviors. Architectural breakthroughs followed with Google’s RT-2 in 2023 [112], a landmark VLA model that unified vision, language, and action tokens, treating robotic control as an autoregressive sequence prediction task (RT-2 Blog). RT-2 discretized actions using Discrete Cosine Transform (DCT) compression and Byte-Pair Encoding (BPE), achieving a 63% improvement in performance on novel objects. Multimodal fusion techniques, such as cross-attention transformers, integrated Vision Transformer (ViT)-processed images (e.g., 400 patch tokens) with language embeddings, enabling robots to execute complex commands like “Pick the red cup left of the bowl.” Additionally, UC Berkeley’s Octo model (2023) introduced an open-source approach with 93M parameters and diffusion decoders, trained on 800, 000 robot demonstrations from the OpenX-Embodiment Dataset, further broadening the research landscape [42].

3.1. Architectural Innovations in VLA Models

From 2023 to 2024, VLA models underwent significant architectural advancements and refined training methodologies. Dual-system architectures emerged as a key innovation, exemplified by NVIDIA’s Groot N1 (2025) [40], which combined System 1 (fast diffusion policies with 10ms latency for low-level control) and System 2 (LLM-based planners for high-level task decomposition). This separation enabled efficient coordination between strategic planning and real-time execution, enhancing adaptability in dynamic environments. Other models, like Stanford’s OpenVLA (2024) [70], introduced a 7B-parameter open-source VLA trained on 970k real-world robot demonstrations, using dual vision encoders (DINOv2 [113] and SigLIP [114]) and a Llama 2 language model [115], outperforming larger models like RT-2-X (55B) [70]. Training paradigms evolved to leverage co-fine-tuning on web-scale vision-language data (e.g., LAION-5B) [116] and robotic trajectory data (e.g., RT-X) [97], aligning semantic knowledge with physical constraints [116]. Synthetic data generation tools like UniSim addressed data scarcity by creating photorealistic scenarios, such as occluded objects, crucial for robust training (UniSim [117]). Parameter efficiency was enhanced through Low-Rank Adaptation (LoRA) adapters [118], which allowed domain adaptation without full retraining, reducing GPU hours by 70%. The introduction of diffusion-based policies, as seen in Physical Intelligence’s pi 0 model (2024) [119], offered improved action diversity but required significant computational resources. These advancements democratized VLA technology, fostering collaboration and accelerating innovation.

Recent VLA models have converged toward three major architectural paradigms that balance efficiency, modularity, and robustness: early fusion models, dual-system architectures, and self-correcting frameworks. Each of these innovations addresses specific challenges in grounding, generalization, and action reliability in real-world robotic systems.

1. Early Fusion Models: One class of approaches focuses on fusing vision and language representations at the input stage before passing them to the policy module. Huang et al.’s EF-VLA model [81], presented at ICLR 2025, exemplifies this trend by retaining the representational alignment established by CLIP [22]. EF-VLA accepts image-text pairs, encodes them with CLIP’s frozen encoders, and fuses the resulting embeddings early in the transformer backbone—prior to action prediction. This design ensures that the semantic consistency learned during CLIP pretraining is preserved, reducing overfitting and enhancing generalization. Notably, EF-VLA demonstrated a 20% performance improvement on compositional manipulation tasks and reached 85% success on previously unseen goal descriptions. By avoiding fine-tuning of the vision-language modules, this approach also preserves computational efficiency and prevents catastrophic forgetting during domain-specific training.

2. Dual-System Architectures: Inspired by dual-process theories of human cognition, models like NVIDIA’s Groot N1 (2025) [40] implement two complementary subsystems: a fast-reactive module (System 1) and a slow-reasoning planner (System 2). System 1 comprises a diffusion-based control policy that operates at 10 ms latency, ideal for fine-grained, low-level control such as end-effector stabilization or adaptive grasping. In contrast, System 2 uses a LLM for task planning, skill composition, and high-level sequencing. The planner parses long-horizon goals (e.g., “clean the table”) into atomic subtasks, while the low-level controller ensures real-time execution. This decomposition enables multi-timescale reasoning and improved safety, especially in environments where rapid reaction and deliberation must co-exist. In benchmark tests on multi-stage household manipulation, Groot N1 outperformed monolithic models by 17% in success rate and reduced collision failures by 28%

3. Self-Correcting Frameworks: A third architectural evolution is the development of self-correcting VLA models, designed to detect and recover from failure conditions without external supervision. SC-VLA (2024) introduces a hybrid execution loop featuring a fast inference path and a slow correction path. In this framework, the default behavior is to predict poses or actions directly from the fused embedding using a lightweight transformer. When failures are detected—e.g., unsuccessful grasps or obstacle collisions—the model invokes a secondary process that performs chain-of-thought reasoning [120, 121]. This path queries an internal LLM (or external expert system) to diagnose failure modes and generate correction strategies [122]. For example, if the robot repeatedly misidentifies an occluded object, the LLM may suggest an active viewpoint change or gripper reorientation. In closed-loop experiments, SC-VLA reduced task failure rates by 35% and significantly improved recoverability in cluttered and adversarial environments.

VLA models exhibit a rich diversity of architectural designs and functional emphases, which can be systematically organized along the dimensions of end-to-end versus modular pipelines, hierarchical versus flat policy structures, and the balance between low-level control and high-level planning (Table 1). End-to-end VLAs, such as CLIPort [22], RT-1 [24], and OpenVLA [70], process raw sensory inputs directly into motor commands via a single unified network. By contrast, component-focused models like VLATest [123] and Chain-of-Affordance [39] decouple perception, language grounding, and action modules, enabling targeted improvements in individual submodules.

Hierarchical architectures have emerged to tackle complex, long-horizon tasks by separating strategic decision making from reactive control. For instance, CogACT [38] and NaVILA [124] employ a two-tier hierarchy where an LLM-based planner issues subgoals to a low-level controller, thereby combining the strengths of System 2 reasoning and System 1 execution. Similarly, ORION [125] integrates a QT-Former for long-term context aggregation with a generative trajectory planner in a cohesive framework.

Low-level policy emphasis is typified by diffusion-based controllers (e.g. Pi-0 [119], DexGraspVLA [126]), which excel at producing smooth, diverse motion distributions but often incur higher computational cost. In contrast, high-level planners (e.g. FAST Pi-0 Fast [86], CoVLA [33]) focus on rapid subgoal generation or coarse trajectory prediction, delegating fine-grained control to specialized modules or traditional motion planners. End-to-end dual-system models like HybridVLA [59] and Helix [127] blur these distinctions by jointly training both components while preserving modular interpretability.

Table 1 further highlights how recent VLAs balance these trade-offs. Models such as OpenDriveVLA [34] and CombatVLA [128] prioritize hierarchical planning in dynamic, safety-critical domains, whereas lightweight, edge-targeted systems like Edge VLA [37] and TinyVLA [65] emphasize real-time low-level policies at the expense of high-level reasoning. This classification framework not only clarifies the design space of VLAs but also guides future development by pinpointing under-explored combinations—such as fully end-to-end, hierarchical models optimized for embedded deployment—that promise to advance both the capabilities and the applicability of VLA systems across robotics, autonomous driving, and beyond.

The classification in Table 1 is significant because it provides a clear framework for comparing diverse VLA architectures, highlighting how design choices—such as end-to-end integration versus hierarchical decomposition—impact task performance, scalability, and adaptability. By categorizing models along dimensions like low-level policy execution and high-level planning, researchers can pinpoint strengths and limitations of existing approaches and identify opportunities for innovation. This taxonomy aids in selecting appropriate architectures for specific applications (e.g., real‐time control vs. strategic reasoning) and guides future development toward hybrid systems that balance responsiveness with cognitive planning, ultimately accelerating progress in embodied AI. Additionally, to synthesize recent advancements in VLA models, Table 2 presents a comparative summary of notable systems developed from 2022 through 2025. Building upon architectural innovations such as early fusion, dual-system processing, and self-correcting feedback loops, these models incorporate diverse design philosophies and training strategies. Each entry highlights the model’s key components—vision and language encoders, action decoders—and the datasets used to ground their capabilities. Models like CLIPort [22] and RT-2 [19] laid early foundations by aligning semantic embeddings with action policies, while more recent frameworks like Pi-Zero, CogACT [38], and Groot N1 [40] introduce scalable architectures with diffusion-based or high-frequency controllers. Several models leverage multimodal pretraining with internet-scale vision-language corpora and robot trajectory datasets, enhancing generalization and zero-shot capabilities [129, 126, 130, 131]. This tabulated comparison serves as a reference point for researchers seeking to understand the functional diversity, domain applicability, and emerging trends in VLA design across real and simulated environments.

Table 1: Taxonomy of VLA models showing structured classification based on architectural paradigms and scientific priorities. We differentiate models by their support for end-to-end execution, hierarchical planning–control decomposition, or component-focused modularity, and further by their emphasis on low-level motor policies versus high-level task planners.

Model Name	Year	End-to-End	Hierarchical	Component Focused	Low-Level Policy	High-Level Planner
CLIPort [22]	2022	\checkmark	✗	✗	\checkmark	✗
RT-1 [24]	2022	\checkmark	✗	✗	\checkmark	✗
Gato [23]	2022	\checkmark	✗	✗	\checkmark	✗

Table 2: Summary of VLA models, detailing each model’s name, architecture features, training dataset, and highlighting their key strengths or unique capabilities in robotics and AI tasks.

Model (Reference)	Architecture Components	Training Dataset	Key Strength / Uniqueness

| CLIPort [22] | • Vision Encoder: CLIP-ResNet50 + Transporter-ResNet
• Language Encoder: CLIP-GPT
• Action Decoder: LingUNet | Self-collected [SC] | Combines semantic CLIP features with spatial Transporter network for precise SE(2) manipulation. |

| RT-1 [24] | • Vision Encoder: EfficientNet
• Language Encoder: Universal Sentence Encoder
• Action Decoder: Transformer | RT-1-Kitchen [SC] | Pioneering Transformer architecture with discretized actions for multi-task kitchen manipulation. |

| RT-2 [19] | • Vision Encoder: ViT-22B/ViT-4B
• Language Encoder: PaLI-X/PaLM-E
• Action Decoder: Symbol-tuning | VQA + RT-1-Kitchen | First large VLA co-finetuned on internet VQA data and robot data for emergent capabilities. |

| Gato [23] | • Vision Encoder: ViT
• Language Encoder: SentencePiece
• Action Decoder: Transformer | Self-collected [SC] | Generalist agent handling Atari, captioning, and robotics through unified tokenization. |

| VIMA [25] | • Vision Encoder: ViT + Mask R-CNN
• Language Encoder: T5
• Action Decoder: Transformer | VIMA-Data [SC] | Multi-modal prompt handling with 6 types of vision-language grounding tasks. |

| ACT [27] | • Vision Encoder: ResNet-18
• Language Encoder: —
• Action Decoder: CVAE-Transformer | ALOHA [SC] | Temporal ensembling for smooth bimanual manipulation with 0.1mm precision. |

| Octo [42] | • Vision Encoder: CNN
• Language Encoder: T5-base
• Action Decoder: Diffusion Transformer | Open X-Embodiment | First policy trained on 4M+ robot trajectories from 22 robot types. |

| VoxPoser [28] | • Vision Encoder: ViLD + MDETR
• Language Encoder: GPT-4
• Action Decoder: MPC | Zero-shot | LLM+VLM composition for constraint-aware motion planning without training. |

| Diffusion Policy [26] | • Vision Encoder: ResNet-18
• Language Encoder: —
• Action Decoder: U-Net/Transformer | Self-collected [SC] | Pioneering diffusion-based visuomotor policy handling multimodal action distributions. |

| OpenVLA [70] | • Vision Encoder: DINOv2 + SigLIP
• Language Encoder: Prismatic-7B
• Action Decoder: Symbol-tuning | OXE + DROID | Open-source alternative to RT-2 with efficient LoRA fine-tuning. |

(Pi-Zero) [119] &

Vision Encoder: PaliGemma VLM backbone
Language Encoder: PaliGemma (multimodal)
Action Decoder: 300M-parameter diffusion model

& Pi-Cross-Embodiment Robot dataset & Lightweight, efficient VLA model (~ 3B params) excelling at general robot control and bimanual manipulation, with strong open-world generalization across diverse robots and tasks [1][5][7].

\addlinespace

| Pi-0 Fast [86] | • Vision Encoder: PaliGemma VLM backbone
• Language Encoder: PaliGemma (multimodal)
• Action Decoder: Autoregressive Transformer with FAST (Frequency-space Action Sequence Tokenization) | Pi- Cross-Embodiment Robot dataset | Variant of Pi-0 optimized for high-frequency, real-time control using compressed action tokens; achieves up to 15x faster inference for discrete robot actions and strong generalization. |

OpenVLA-OFT [69] &

Vision Encoder: SigLIP + DINOv2 (multi-view)
Language Encoder: Llama-2 7B
Action Decoder: Parallel decoding with action chunking and L1 regression

& LIBERO benchmark, bimanual ALOHA & Optimized fine-tuning variant of OpenVLA achieving 97.1% success on LIBERO, with 26x faster inference via parallel decoding and action chunking; excels at high-frequency bimanual control.

\addlinespace

| RDT-1B [135] | • Vision Encoder: Multi-view RGB image encoder
• Language Encoder: Transformer-based language module
• Action Decoder: Diffusion Transformer with unified action space | 1M+ multi-robot episodes (46 datasets), fine-tuned on 6K+ bimanual ALOHA episodes | 1.2B-parameter diffusion foundation model for bimanual manipulation; excels at language-conditioned, dexterous control and zero-shot generalization, with strong but task-specific performance in multi-object settings. |

| Helix ¹ | • Vision Encoder: Open-source VLM (System 2) for multimodal scene and language understanding at 7–9 Hz
• Language Encoder: Integrated with VLM for broad generalization and semantic comprehension
• Action Decoder: Transformer-based visuomotor policy (System 1) for continuous, full upper-body control at 200 Hz | End-to-end on Figure robot data (pixels and language to actions) | First VLA model for real-time, high-DoF humanoid control; enables zero-shot generalization, fine-grained dexterity, and collaborative multi-robot manipulation in open-world tasks. |

https://www.figure.ai/news/helix

| CogACT [38] | • Vision Encoder: DINOv2 ViT-L/14, SigLIP ViT-So400M/14
• Language Encoder: Llama-2 (via Prismatic-7B VLM)
• Action Decoder: Diffusion Transformer (DiT-Base, 300M parameters) | Open X-Embodiment (OXE) subset, real-world Realman & Franka tasks | Componentized VLA with specialized diffusion action transformer; outperforms OpenVLA by 59.1% in real-world success, excels at adaptation and generalization to new robots and unseen objects. |

| Chain-of-Affordance (CoA) [39] | • Vision Encoder: Visual encoder with affordance feature extraction
• Language Encoder: Transformer-based language module for sequential reasoning prompts
• Action Decoder: Autoregressive and diffusion policy with affordance-conditioned outputs | LIBERO benchmark, real and simulated manipulation tasks | Incorporates reasoning via sequential affordances (object, grasp, spatial, movement); achieves superior LIBERO performance over OpenVLA, excelling in spatial reasoning and obstacle avoidance for precise task completion. |

| Edge VLA (EVLA) [37] | • Vision Encoder: SigLIP + DINOv2
• Language Encoder: Qwen2 (0.5B parameters)
• Action Decoder: Joint control prediction (non-autoregressive) | Bridge dataset, OXE, 1.2M text-image pairs | Lightweight VLA model optimized for edge devices (e.g., Jetson Nano) with 30–50 Hz inference; achieves performance comparable to OpenVLA while enabling efficient, real-time deployment on low-power hardware. |

| ShowUI-2B [41] | • Vision Encoder: UI-Guided Visual Token Selection (transformer-based)
• Language Encoder: Interleaved vision-language-action streaming
• Action Decoder: Transformer for GUI action sequence prediction | 256K high-quality GUI instruction-following dataset | Lightweight 2B-parameter VLA specialized for digital task automation; excels at GUI/web navigation and screenshot grounding with efficient token selection and unified vision-language-action reasoning. |

| Groot N1 [40] | • Vision Encoder: NVIDIA Eagle-2 VLM backbone (vision-language)
• Language Encoder: Integrated with VLM for high-level planning and reasoning
• Action Decoder: Diffusion Transformer (DiT) for precise, high-frequency action generation | Multimodal data: human demonstrations, robot trajectories, synthetic simulation, and internet video | Hybrid dual-system architecture for generalist humanoid robots, combining high-level planning with diffusion-based execution; enables dexterous, multi-step control and strong generalization across tasks and embodiments. |

| Seer [132] | • Vision Encoder: Visual backbone optimized for grounding and perception
• Language Encoder: Transformer-based language module
• Action Decoder: Autoregressive action prediction head | LIBERO benchmark | VLA model focused on visual perception and action prediction; achieves competitive results with strong visual grounding for manipulation, but is outperformed by OpenVLA-OFT [1][6][8][10]. |

| DiffusionVLA [143] | • Vision Encoder: Transformer-based visual encoder for contextual perception
• Language Encoder: Autoregressive reasoning module with next-token prediction
• Action Decoder: Diffusion policy head for robust action sequence generation | LIBERO benchmark, factory sorting, zero-shot bin-picking tasks | Leverages diffusion-based action modeling for precise control; demonstrates robustness and interpretability, but is less generalizable than CoA in spatial configurations. |

| NaVILA [124] | • Vision Encoder: CLIP + CNN
• Language Encoder: LLaMA-2 (task command + nav goal)
• Action Decoder: Two-level controller: topological graph planner + RL-based locomotion | Real-world legged robot nav demos | Modular hierarchy enables robust terrain generalization and 88% real-world nav success using natural language |

| RoboNurse-VLA [133] | • Vision Encoder: SAM2 + RGB-D
• Language Encoder: LLaMA 2 + voice-to-text encoder
• Action Decoder: Joint pose regression with gripper classifier | Surgical handover videos and voice prompts | Enables accurate, real-time surgical tool handover; strong robustness to tool novelty and dynamic OR scenes |

| Mobility VLA [134] | • Vision Encoder: Long-context ViT + goal image encoder
• Language Encoder: T5-based instruction encoder
• Action Decoder: Graph planner with visual goal localization | MINT dataset: vision-language instruction tours | Robust navigation from multimodal input; generalizes across large unseen spaces via topological mapping |

| TinyVLA [65] | • Vision Encoder: FastViT with low-latency encoding
• Language Encoder: Compact language encoder (128-d)
• Action Decoder: Diffusion policy decoder (50M params) | Mini-ALOHA + SC tasks | Outperforms OpenVLA in speed and precision; does not require pretraining; inference 5x faster with minimal compute |

| QUAR-VLA [43] | • Vision Encoder: CLIP + proprioceptive embedding
• Language Encoder: BERT + custom grounding adapter
• Action Decoder: Transformer for full-body command decoding | QUART dataset (locomotion + manipulation) | Quadruped-specific control with strong sim-to-real transfer and fine-grained instruction alignment |

| ChatVLA [103] | • Vision Encoder: Vision encoder integrated with Phase-Aligned transformer
• Language Encoder: Prismatic LLM with Mixture-of-Experts
• Action Decoder: Unified vision-language-action planner | Unified chat-action dataset (web, robot) | Excels at joint VQA and planning; mitigates forgetting; efficient across manipulation and conversational tasks |

| PointVLA [105] | • Vision Encoder: CLIP + 3D Point Cloud (via skip blocks)
• Language Encoder: LLaMA-2
• Action Decoder: Transformer with spatial token fusion | Few-shot spatial tasks (real + sim) | Excels at long-horizon and spatial reasoning tasks; avoids retraining by preserving pretrained 2D knowledge |

| VLA-Cache [63] | • Vision Encoder: SigLIP with token memory buffer
• Language Encoder: Prismatic-7B
• Action Decoder: Transformer with dynamic token reuse | ALOHA + real-world sim fusion | 40–50% faster inference with near-zero loss; dynamically reuses static features for real-time robotics |

| HybridVLA [59] | • Vision Encoder: CLIP + DINOv2
• Language Encoder: LLaMA-2
• Action Decoder: Hybrid diffusion + autoregressive ensemble | RT-X + synthetic task fusion | Achieves robust control in complex multi-arm settings via dynamic ensemble; strong sim2real generalization |

| MoLe-VLA [138] | • Vision Encoder: Multi-stage ViT with STAR router
• Language Encoder: CogKD-enhanced transformer
• Action Decoder: Sparse transformer with dynamic routing | RLBench + real-world manipulation tasks | Brain-inspired efficiency with 5.6x speedup; selective layer activation with high task success (+8%) |

| UAV-VLA [136] | • Vision Encoder: ViT for aerial imagery
• Language Encoder: GPT for instruction parsing
• Action Decoder: Transformer-based path planner | Satellite + UAV imagery instructions | Zero-shot aerial task planning; intuitive language grounding; scalable to large unmapped environments |

| DexGraspVLA [126] | • Vision Encoder: Object-centric spatial ViT
• Language Encoder: Transformer with grasp sequence reasoning
• Action Decoder: Diffusion controller for grasp pose generation | Dexterous grasping benchmark (sim + real) | 90%+ zero-shot success on diverse objects; excels at lighting, background variation, and unseen conditions |

| GraspVLA [144] | • Vision Encoder: Multi-view DINO-v2 and SigLIP fusion; InternLM2 language model
• Language Encoder: VLM predicts bounding boxes and grasp poses
• Action Decoder: Flow-matching based action expert via Progressive Action Generation (PAG) | SynGrasp-1B (1B synthetic frames), GRIT (Internet grounding dataset) | First synthetic-data-pretrained grasping VLA; enables sim-to-real generalization, robust grasp policy via PAG; supports zero-shot and few-shot generalization to long-tail object classes and human-centric preferences |

| Interleave-VLA [145] | • Vision Encoder: InternVL2.5 and OWLv2 for open-vocabulary and image-text token integration
• Language Encoder: Qwen2.5 for instruction parsing and visual-language verification
• Action Decoder: Continuous action predictor adapted from OpenVLA and

π0\pi^0

with diffusion-policy controller | Open Interleaved X-Embodiment (210k episodes from 11 real-world datasets) | First end-to-end VLA model for interleaved image-text instructions; improves out-of-domain generalization 2–3× and enables zero-shot execution from hand-drawn sketches and novel multimodal prompts |

3.2. Training and Efficiency Advancements in Vision–Language–Action Models

VLA models have seen rapid progress in training and optimization techniques to reconcile multimodal inputs, reduce compute requirements, and enable real-time control. Key areas of advancement include:

Data-Efficient Learning.
- Co-fine-tuning on massive vision–language corpora (e.g. LAION-5B) and robotic trajectory collections (e.g. Open X-Embodiment) aligns semantic understanding with motor skills. OpenVLA (7B parameters) achieves a 16.5% higher success rate than a 55B-parameter RT-2 variant, demonstrating that co-fine-tuning yields strong generalization with fewer parameters [116, 97, 70].
- Synthetic Data Generation via UniSim produces photorealistic scenes—including occlusions and dynamic lighting—to augment rare edge-case scenarios, improving model robustness in cluttered environments by over 20% [117, 42].
- Self-Supervised Pretraining adopts contrastive objectives (à la CLIP) to learn joint visual–text embeddings before action fine-tuning, reducing reliance on task-specific labels. Qwen2-VL leverages self-supervised alignment to accelerate downstream grasp-and-place convergence by 12% [4, 11].
Parameter-Efficient Adaptation. Low-Rank Adaptation (LoRA) inserts lightweight adapter matrices into frozen transformer layers, cutting trainable weights by up to 70% while retaining performance [118]. The Pi-0 Fast variant uses merely 10 M adapter parameters atop a static backbone to deliver continuous 200 Hz control with negligible accuracy loss [86].
Inference Acceleration.
- Compressed Action Tokens (FAST) and Parallel Decoding in dual-system frameworks (e.g. Groot N1) yield 2.5× faster policy steps, achieving sub-5 ms latencies at a modest cost to trajectory smoothness [40, 73].
- Hardware-Aware Optimizations—including tensor-core quantization and pipelined attention kernels—shrink runtime memory footprints below 8 GB and enable real-time inference on embedded GPUs [69].

Together, these methods have transformed VLAs into practical agents capable of handling language-conditioned, vision-guided tasks in dynamic, real-world settings.

3.3. Parameter-Efficient Methods and Acceleration Techniques in VLA Models

Building on advances in data‐efficient training, recent work has focused on reducing the parameter footprint and improving inference speed of VLA models—critical for deployment on resource‐constrained robotic platforms.

Low‐Rank Adaptation (LoRA). LoRA injects small trainable rank‐decomposition matrices into frozen transformer layers, enabling fine‐tuning of billion‐parameter VLAs with only a few million additional weights. In OpenVLA, LoRA adapters (20M parameters) tuned a 7 B‐parameter backbone on commodity GPUs in under 24 h, cutting GPU compute by 70% compared to full backpropagation [118, 70]. Crucially, LoRA‐adapted models retain their high‐level language grounding and visual reasoning capabilities while adapting to new robotic manipulation tasks (e.g. novel object shapes), making large VLAs accessible to labs without supercomputing resources.
Quantization. Reducing weight precision to 8‐bit integers (INT8) shrinks model size by half and doubles on‐chip throughput. OpenVLA experiments show that INT8 quantization on Jetson Orin maintains 97% of full‐precision task success across pick‐and‐place benchmarks, with only a 5% drop in fine‐grained dexterity tasks [116, 70]. Complementary methods such as post‐training quantization with per‐channel calibration further minimize accuracy loss in high‐dynamic‐range sensor inputs [113]. These optimizations allow continuous control loops at 30 Hz on 50 W edge modules.
Model Pruning. Structured pruning removes entire attention heads or feed‐forward sublayers identified as redundant. While less explored in VLA than in pure vision or language models, early studies on Diffusion Policy demonstrate that pruning up to 20% of ConvNet‐based vision encoders yields negligible performance degradation in grasp stability [26]. Similar schemes applied to transformer‐based VLAs (e.g. RDT‐1B) can reduce memory footprint by 25% with under 2% drop in task success, paving the way for sub‐4 GB deployments [135, 38].
Compressed Action Tokenization (FAST). FAST reformulates continuous action outputs as frequency‐domain tokens, compressing long control sequences into concise descriptors. The Pi‐0 Fast variant achieved 15× faster inference with a 300 M‐parameter diffusion head by tokenizing 1000 ms action windows into 16 discrete tokens, enabling 200 Hz policy rates on desktop GPUs [86]. This approach trades minimal trajectory granularity for large speedups, suited for high‐frequency control in dynamic tasks like bimanual assembly.
Parallel Decoding and Action Chunking. Autoregressive VLAs traditionally decode actions token by token, incurring sequential latency. Parallel decoding architectures (e.g. in Groot N1) decode groups of spatial–temporal tokens concurrently, achieving a 2.5× reduction in end‐to‐end latency on 7‐DoF arms at 100 Hz, with less than 3 mm positional error increase [40, 73]. Action chunking further abstracts multi‐step routines into single tokens (e.g. “pick‐and‐place‐cup”), cutting inference steps by up to 40% in long‐horizon tasks like kitchen workflows [25].
Reinforcement Learning–Supervised Hybrid Training. The iRe‐VLA framework alternates between reinforcement learning (RL) in simulation and supervised fine‐tuning on human demonstrations to stabilize policy updates. By leveraging Direct Preference Optimization (DPO) to shape reward models and Conservative Q‐Learning to avoid extrapolation error, iRe‐VLA reduces sample complexity by 60% versus pure RL, while maintaining the semantic fidelity imparted by language‐conditioned priors [98, 102]. This hybrid approach yields robust policies for tasks with sparse feedback, such as dynamic obstacle avoidance.
Hardware‐Aware Optimizations. Compiler‐level graph rewrites and kernel fusion (e.g. via NVIDIA TensorRT‐LLM) exploit target hardware features—tensor cores, fused attention, and pipelined memory transfers—to accelerate both transformer inference and diffusion sampling. In OpenVLA‐OFT, such optimizations reduced inference latency by 30% on RTX A2000 GPUs and lowered energy per inference by 25% compared to standard PyTorch execution [69]. This makes real‐time VLAs feasible on mobile robots and drones with strict power budgets.

Discussion. Parameter‐efficient adaptation and inference acceleration techniques collectively democratize VLA deployment:

LoRA and quantization empower smaller labs to fine‐tune and operate billion‐parameter VLAs on consumer‐grade hardware, unlocking cutting‐edge semantic understanding for robots [118, 70].
Pruning and FAST tokenization compress model and action representations, enabling sub‐4 GB, sub‐5 ms control loops without sacrificing precision in dexterous tasks [135, 86].
Parallel decoding and action chunking overcome sequential bottlenecks of autoregressive policies, supporting 100–200 Hz decision rates needed for agile manipulation and legged locomotion [40, 73].
Hybrid RL‐SL training stabilizes exploration in complex environments, while hardware‐aware compilation ensures real‐time performance on edge accelerators [98, 69].

Together, these advances make it practical to embed VLA models across industrial manipulators, assistive drones, and consumer robots, bridging the gap from research prototypes to real‐world autonomy.

3.4. Applications of Vision-Language-Action Models

VLA models are rapidly emerging as foundational building blocks for embodied intelligence, integrating perception, natural language understanding, and motor control within a unified architecture. By encoding visual and linguistic modalities into shared semantic spaces and generating contextually grounded actions, VLA models enable seamless interaction between agents and their environments [38, 34]. This multimodal capacity has positioned VLAs as transformative agents across a wide spectrum of real-world applications. In humanoid robotics, systems like Helix and RoboNurse-VLA combine vision, language, and dexterous manipulation to assist with domestic tasks and surgical operations, demonstrating real-time reasoning and safety-aware control [133, 65]. In autonomous vehicles, models such as OpenDriveVLA and ORION process dynamic visual streams and natural language instructions to make transparent, adaptive driving decisions in complex urban environments [125, 34]. Industrial deployments leverage VLA architectures for high-precision assembly, inspection, and collaborative manufacturing [38]. In agriculture, VLA-powered robotic systems enable vision-guided fruit harvesting, plant monitoring, and anomaly detection, reducing labor dependency and increasing sustainability. Furthermore, recent advances in interactive augmented reality systems utilize VLA models for real-time, language-conditioned spatial navigation, guiding users in indoor and outdoor settings based on voice or visual cues [136, 146]. Across these domains, VLAs offer a unified framework for robust, adaptable, and semantically aligned task execution, marking a pivotal shift toward embodied generalist agents.

Table 2 in the appendix shows the recent VLA models by summarizing their methodologies, application domains, and key innovations.

The following subsections chronologically explore the application areas in depth as shown in Figure 11.

Table 2: Comparison of VLA methodologies, application areas, and innovations. This comprehensive table compares cutting-edge VLA models by summarizing their methodologies, application domains, and key innovations.

Reference & Year	VLA Methodology	VLA Application Area	Strength and Key Innovation
CogACT [38] & 2024	Componentized VLA with specialized action module using diffusion transformers	Industrial robotics, language-guided manipulation	Robust action modeling, rapid adaptation, strong generalization, much higher task success rates
VLATest [123] & 2024	Automated framework for large-scale VLA model testing in manipulation	Robotic manipulation: benchmarking VLA robustness and reliability	Diverse scene generation, multi-model/task evaluation, reveals robustness gaps, guides VLA improvement
NaVILA [124] & 2024	Two-level VLA: high-level vision-language generates mid-level nav commands, RL locomotion executes	Legged robot navigation via natural language in cluttered, real-world scenes	Modular mid/low-level split, strong generalization, 88% real-world success, robust to diverse terrains

3.4.1. Humanoid Robotics

Humanoid robots, designed to mimic the form and functionality of the human body, represent one of the most demanding yet impactful domains for the deployment of VLA models. These platforms must seamlessly perceive complex environments, understand spoken or written natural language, and perform intricate physical tasks with human-level dexterity [148, 149]. The core strength of VLA models lies in their ability to unify perception, cognition, and control into a single, end-to-end trainable framework—allowing humanoid robots to interpret visual inputs (e.g., RGB-D imagery of cluttered scenes), comprehend linguistic instructions (e.g., “place the spoon in the drawer”), and generate precise motor trajectories [150, 151].

Recent advances have significantly accelerated the deployment of VLAs in humanoid robotics. For example, Helix², a humanoid robot developed by Figure AI, leverages a fully integrated VLA model to perform full-body manipulation at high frequency, controlling arms, hands, torso, and even fine-grained finger motion in real time. The architecture follows a dual-system design: a multimodal transformer processes inputs such as language commands and vision streams, while a real-time motor policy outputs dense action vectors at 200 Hz. This allows Helix to generalize across previously unseen objects and tasks, adapting fluidly to changing environments without the need for task-specific retraining.

https://www.figure.ai/news/helix

The key advantage of VLAs in humanoid systems is their ability to scale across diverse tasks using shared representations [152]. Unlike traditional robotic systems that rely on task-specific programming or modular pipelines, VLA-powered humanoids operate under a unified token-based framework. Vision inputs are encoded via pretrained vision-language models like DINOv2 or SigLIP, while instructions are processed using large language models such as Llama-2 or GPT-style encoders. These representations are fused into prefix tokens that capture the full context of the scene and task. Action tokens are then generated autoregressively, similar to language decoding, but represent motor commands for the robot’s joints and effectors.

This capability enables humanoid robots to operate effectively in human-centric spaces, such as households, hospitals, and retail environments. In domestic settings, VLA-powered robots can clean surfaces, prepare simple meals, or organize objects simply by interpreting voice commands [150, 151]. In healthcare, systems like RoboNurse-VLA [133] have demonstrated the ability to perform precise instrument handovers to surgeons using real-time voice and visual cues. In retail, humanoid platforms equipped with VLAs can assist with customer queries, restock shelves, and navigate store layouts without explicit pre-programming [152].

What distinguishes modern humanoid VLAs is their ability to run on embedded, low-power hardware, making real-world deployment viable. For instance, systems such as TinyVLA [65] and MoManipVLA [45] demonstrate efficient inference pipelines that run on Jetson-class GPUs, enabling mobile deployment without compromising performance. These models exploit techniques like diffusion-based policies, LoRA-based fine-tuning, and dynamic token caching to minimize compute cost while retaining high precision and generalization.

In logistics and manufacturing, humanoid VLAs are already making a commercial impact. Robots like Figure 01 are deployed in warehouses to perform repetitive, physically intensive tasks—such as picking, sorting, and shelving—alongside human workers. Their ability to handle novel object categories and dynamically changing scenes is powered by continual learning and robust multimodal grounding [63, 38].

As VLA models continue to advance in their capacity for diverse action generation, spatial reasoning, and real-time adaptation, humanoid robots are emerging as highly capable assistants across homes, industrial settings, and public spaces. Their strength lies in their ability to unify perception, language comprehension, and motor control through a shared token-based architecture—enabling seamless, context-aware behavior in unstructured human environments.

Figure 12: This figure illustrates “Helix,” a next-generation humanoid robot executing a household task using a VLA framework. Upon receiving a verbal command, Helix integrates a vision-language model (e.g., SigLIP) and a language model (e.g., LLaMA-2) to jointly perceive and interpret the environment. A hierarchical VLA controller plans and executes sub-tasks—opening the fridge, grasping a bottle—while an agentic AI module adapts actions in real time. This demonstrates VLA-based generalist robotics with dynamic task adaptation and safe, semantically grounded manipulation.

💭 Click to ask about this figure

For example, as depicted in the Figure 12, consider 'Helix', a state-of-the-art humanoid robot equipped with a next-generation VLA model. When instructed verbally, “Please take the water bottle from the fridge, ” Helix activates its integrated perception system, where a foundation vision-language model (e.g., SigLIP or DINOv2) segments the visual scene to identify the refrigerator, its handle, and the bottle. The language input is processed by a large language model such as LLaMA-2, which tokenizes the instruction and fuses it with the visual context. This fused representation is passed to a hierarchical controller: the high-level policy plans the task sequence (locate handle, pull door, identify bottle, grasp), while a mid-level planner defines motor primitives, such as grasp type and joint trajectories. The low-level VLA controller—often based on diffusion policy networks—executes these actions with sub-second latency. Upon encountering variations (e.g., a tilted bottle or slippery grip), Helix's agentic AI module performs micro-policy refinement in real time, adjusting its grip based on feedback. This example illustrates the transformative potential of humanoid VLAs. From kitchens to clinics, these systems not only interpret complex instructions and execute physical tasks with dexterity but also adapt to environmental unpredictability. By embedding agentic reasoning and safety alignment mechanisms, modern humanoid robots powered by VLAs are transitioning from narrow-task performers to generalist, trustworthy collaborators. As energy-efficient models like TinyVLA and MoManipVLA mature, deployment on mobile, low-power platforms becomes increasingly practical—ushering in a new era of embodied, socially aligned AI.

3.4.2. Autonomous Vehicle Systems

Autonomous vehicles (AVs), including self-driving cars, trucks, and aerial drones, represent a frontier application domain for VLA models, where safety-critical decision-making demands tightly coupled perception, semantic understanding, and real-time action generation. Unlike traditional modular AV pipelines that decouple perception, planning, and control, VLA frameworks offer an integrated architecture that processes multimodal inputs—including visual streams, natural language instructions, and internal state information—within a unified autoregressive model capable of outputting precise control signals.

VLA models empower AVs to comprehend complex environments beyond pixel-level object recognition. For instance, a self-driving car navigating an urban setting must detect traffic signs, understand pedestrian behavior, and interpret navigation commands such as “take the second right after the gas station.” These tasks involve fusing visual and linguistic signals to understand spatial relationships, predict intent, and generate context-aware driving actions. VLAs encode this information through token-based representations, where visual encoders (e.g., ViT, CLIP), language models (e.g., LLaMA-2), and trajectory decoders operate in a coherent semantic space, enabling the vehicle to reason about high-level goals and translate them into low-level motion.

A notable contribution in this direction is CoVLA [33], which provides a comprehensive dataset pairing over 80 hours of real-world driving videos with synchronized sensor streams (e.g., LiDAR, odometry), detailed natural language annotations, and high-resolution driving trajectories. This dataset enables training VLA models to align perceptual and linguistic features with physical actions. CoVLA employs CLIP for visual grounding, LLaMA-2 for instruction embedding, and trajectory decoders for motion prediction. This configuration allows AVs to interpret verbal cues (e.g., “yield to ambulance”) and environmental conditions (e.g., merging traffic) to make transparent and safe driving decisions.

OpenDriveVLA [34] advances the state of VLA modeling by integrating hierarchical alignment of 2D/3D multi-view vision tokens with natural language inputs. Its architecture leverages both egocentric spatial perception and external scene understanding to construct a dynamic agent-environment-ego interaction model. Through autoregressive decoding, OpenDriveVLA generates both action plans (e.g., steering angle, acceleration) and trajectory visualizations interpretable to humans. Its end-to-end framework achieves state-of-the-art performance on planning benchmarks and question-answering tasks related to driving scenarios, demonstrating its robustness in urban navigation and behavioral prediction.

Another seminal model, ORION [125], pushes the boundaries of closed-loop autonomous driving by incorporating a QT-Former to retain long-horizon visual context, a large language model for reasoning over traffic narratives, and a generative trajectory planner. ORION excels at aligning the discrete reasoning space of vision-language models with the continuous control space of AV motion. This unified optimization results in accurate visual question answering (VQA) and trajectory planning, crucial for scenarios involving ambiguous human instructions or occluded obstacles (e.g., “take the exit after the red truck”).

For example, as depicted in Figure 13 consider an autonomous delivery vehicle, “AutoNav, ” operating in a dense urban environment using a next-generation VLA architecture. As AutoNav receives a cloud-based instruction—“Drop off the package near the red awning beside the bakery, then return to base avoiding construction zones”—its onboard VLM (e.g., CLIP or SigLIP) parses the visual stream from multiple cameras, identifying dynamic landmarks such as bakery signs, red awnings, and traffic cones. Simultaneously, the LLM module grounded in LLaMA-2 decodes the instruction and fuses it with real-time sensory context including LiDAR, GPS, and inertial odometry. A hierarchical control stack processes these multimodal signals via an autoregressive VLA decoder that integrates egocentric views and world-centric maps to plan adaptive paths. As the vehicle approaches the delivery location, unexpected pedestrian activity prompts an agentic submodule to trigger trajectory re-planning using a reinforcement learning-inspired policy refinement routine. At the same time, AutoNav audibly warns pedestrians and recalibrates its speed to maintain safety margins. This interplay of semantic understanding, perceptual grounding, and adaptive control exemplifies the power of VLA-based systems in achieving interpretable, human-aligned behavior in safety-critical scenarios. It also demonstrates how such integration can surpass traditional perception-planning-control pipelines in autonomy, transparency, and decision-making agility.

In aerial robotics, VLAs enhance the capabilities of delivery drones and UAVs. Models such as UAV-VLA [136] combine satellite imagery, natural language mission descriptions, and onboard sensing to execute high-level commands (e.g., “deliver to the rooftop pad with the blue tarp”). These systems use modular VLA architectures, where a vision-language planner parses global context and a flight controller executes precise waypoints, supporting applications in logistics, disaster response, and military reconnaissance.

As autonomous systems increasingly operate in unstructured environments, VLAs provide a scalable, interpretable, and data-efficient alternative to traditional pipelines. By learning from large-scale multimodal datasets and modeling decision-making as token prediction, VLAs align human-level semantics with robotic motion, paving the way for safer, smarter autonomous driving and navigation technologies.

3.4.3. Industrial Robotics

Industrial robotics is undergoing a paradigm shift with the integration of VLA models, enabling a new generation of intelligent robots capable of high-level reasoning, flexible task execution, and natural communication with human operators [153, 154]. Traditional industrial robots typically operate in highly structured environments using rigid programming, often requiring extensive reconfiguration and manual intervention when adapting to new assembly lines or product variants [155, 156]. Such systems lack the semantic grounding and adaptability required for modern dynamic manufacturing settings.

VLA models, by contrast, offer a more human-interpretable and generalizable framework. Through the joint embedding of visual inputs (e.g., component layout or conveyor belt state), natural language instructions (e.g., “tighten the screw on the red module”), and robot state, VLAs can infer context and execute appropriate control commands in real-time [157, 158, 17]. Vision transformers (e.g., ViT, DINOv2), large language models (e.g., LLaMA-2), and autoregressive or diffusion-based action decoders form the backbone of these systems, allowing the robot to parse multimodal instructions and perform actions grounded in its environment.

One of the most significant contributions in this domain is CogACT [38], a componentized VLA framework explicitly designed for industrial robotic manipulation. Unlike early VLAs that relied on frozen language-vision embeddings followed by direct action quantization, CogACT introduces a diffusion-based action transformer that models action sequences more robustly and adaptively. The system uses a visual-language encoder (e.g., Prismatic-7B) to extract high-level scene and instruction embeddings, which are then passed to a diffusion transformer (DiT-Base) to generate fine-grained motor actions. This modular separation enables superior generalization to unseen tools, parts, and layouts while preserving interpretability and robustness under real-world constraints.

Furthermore, CogACT demonstrates rapid adaptation across different robot embodiments—such as 6-DoF arms or bimanual systems—through efficient fine-tuning, making it suitable for deployment across heterogeneous factory environments [38]. Empirical evaluations show that CogACT outperforms prior models like OpenVLA by over 59% in real-world task success rates, especially in complex, high-precision tasks such as multi-step assembly, screw fastening, and part sorting.

As manufacturing shifts toward Industry 4.0 paradigms, VLAs promise to reduce programming overhead, support voice-commanded robot programming, and facilitate real-time human-robot collaboration on mixed-initiative tasks. While execution precision, safety guarantees, and latency optimizations remain areas of active research, the use of VLA models in industrial robotics marks a substantial step toward autonomous, intelligent, and adaptable robotic factories.

3.4.4. Healthcare and Medical Robotics

Healthcare and medical robotics represent a high-stakes domain where precision, safety, and adaptability are paramount—qualities that VLA models are increasingly well-suited to provid [133, 159]. Traditional medical robotic systems rely heavily on teleoperation or pre-programmed behaviors [160, 161], limiting their autonomy and responsiveness in dynamic surgical or care environments. In contrast, VLA models offer a flexible framework that integrates real-time visual perception, language comprehension, and fine-grained motor control, enabling medical robots to understand high-level instructions and autonomously perform intricate procedures or assistance tasks [38, 43, 162].

Figure 14: a) This figure illustrates a VLA surgical system executing the task “apply a suture to the left coronary artery.” The vision module identifies anatomical targets, the language model interprets the instruction, and the action decoder generates precise motor commands, enabling adaptive tool control, real-time feedback, and safe autonomous operation; b) A VLA-powered assistive robot perceives patient behavior, processes verbal requests (e.g., “bring my walker”), and autonomously executes context-aware motion plans, enabling real-time assistance in eldercare, rehabilitation, and hospital logistics without relying on predefined scripts or manual oversight.

💭 Click to ask about this figure

In surgical robotics, VLAs can dramatically enhance capabilities in minimally invasive operations [163, 164]. These systems can fuse laparoscopic video feeds [165], anatomical maps [166, 163], and voice commands into a unified tokenized representation using vision encoders (e.g., ViT, SAM-2) and language models (e.g., LLaMA, T5) [167]. For instance, as depicted in Figure 14 a, in a task like “apply a suture to the left coronary artery, ” the vision module identifies the anatomical target, while the language module contextualizes the instruction. The action decoder then translates the fused semantic embedding into stepwise motion commands with sub-millimeter precision. This enables the robot to adaptively reposition tools, apply dynamic force feedback, and avoid critical structures, reducing the need for surgeon micromanagement and minimizing risk of human error.

Beyond the operating room, VLA models are powering a new generation of patient-assistive robots in eldercare, rehabilitation, and hospital logistics. These systems can autonomously perceive patient behavior, understand spoken or gestural input, and execute responsive tasks such as retrieving medication, guiding mobility aids, or notifying caregivers during emergencies. For example, as depicted in Figure 14 b, a VLA-enabled robot can visually detect a patient attempting to rise from bed, interpret a verbal request such as “bring my walker, ” and generate a context-appropriate motion plan to assist—without predefined scripts or constant supervision.

Recent VLA frameworks such as RoboNurse-VLA [133] highlight the real-world feasibility of this approach. RoboNurse employs SAM-2 for semantic scene segmentation and LLaMA-2 for command comprehension, integrated into a real-time voice-to-action pipeline that enables robots to assist with surgical instrument handovers in operating rooms [133]. The system demonstrates robustness to diverse tools, varied lighting conditions, and noisy environments—common challenges in clinical settings.

Additionally, VLA architectures offer advantages in explainability and auditability, both critical in regulated medical domains [168, 169]. Scene grounding and trajectory prediction can be visualized and reviewed post-hoc [170], which could facilitate clinical trust and enabling FDA-style validation pipelines. LoRA-based fine-tuning allows adaptation to specific hospital environments or procedural workflows with minimal data and compute [171, 172, 166].

Importantly, the multimodal foundation of VLA models enables cross-domain transferability: the same model trained on surgical tool manipulation can be adapted to patient mobility tasks with modest retraining [173]. This modularity significantly reduces development time and cost compared to task-specific automation systems [174]. As medical robotics transitions from teleoperated assistance to semi-autonomous and collaborative systems, VLA models stand at the core of this transformation.

By combining high-level semantic understanding with low-level control, VLAs provide a unified solution for scalable, human-aligned, and adaptive robotic healthcare [175, 103, 140]. As healthcare systems face increasing demand and workforce shortages, VLA-driven robotics will play a crucial role in enhancing medical precision, operational efficiency, and patient-centered care.

3.4.5. Precision and Automated Agriculture

As illustrated in Figure 15, VLA models are emerging as transformative tools in precision and automated agriculture, offering intelligent, adaptive solutions for labor-intensive tasks across diverse farming landscapes [176, 136]. Unlike traditional agricultural automation systems that depend on rigid, sensor-driven pipelines and require manual reprogramming for each task or environmental variation [177, 178], VLAs integrate multimodal perception, natural language understanding, and real-time action generation within a unified framework [179, 180]. This enables autonomous ground robots and drones to interpret complex field scenes, follow spoken or text-based farming instructions, and generate context-aware actions such as selective fruit picking or adaptive irrigation. The ability of VLAs to dynamically adjust to occlusions, terrain irregularities, or varying crop types—combined with training on synthetic, photorealistic datasets—allows them to generalize across geographies and seasons. By leveraging action tokenization [181], transformer-based policy generation [182, 183], and techniques like LoRA fine-tuning [118], these systems are redefining the scalability and intelligence of agricultural robotics for sustainable and precision-driven farming.

Figure 15: This diagram illustrates the application of VLA models in precision and automated agriculture. A ground robot uses vision encoders to detect ripe fruits and interprets instructions such as “pick only Grade A fruits” through language encoders. Action tokens then guide robotic manipulators for efficient, damage-free picking. Drones leverage VLA models to analyze aerial imagery and verbal commands for targeted irrigation. Synthetic training environments and LoRA-based adaptation enable models to generalize across crop types, environmental conditions, and geographies. This VLA-driven pipeline promotes sustainable agriculture by improving productivity, reducing manual labor, and enhancing decision-making through multimodal perception and control.

💭 Click to ask about this figure

In modern orchards and crop fields, VLAs can process visual inputs from RGB-D cameras, multispectral sensors, or drones to monitor plant growth, detect diseases, and identify nutrient deficiencies. Vision transformers (e.g., ConvNeXt, DINOv2) encode spatial and semantic information from visual scenes, while large language models (e.g., T5, LLaMA) parse natural language commands—such as “inspect the east plot for powdery mildew” or “harvest ripe apples near the irrigation trench.” Through token fusion, these modalities are aligned in a shared representation space, allowing robots to execute fine-grained, context-aware actions with precision.

For instance, in fruit-picking tasks, as illustrated in Figure 15, a VLA-equipped ground robot can identify ripe produce using image-based ripeness cues, interpret user-specified criteria such as “pick only Grade A fruits, ” and execute motion sequences via action tokens that control its end-effector. This approach ensures minimal crop damage, optimizes pick rates, and allows real-time adaptation to unexpected variables like occlusions or terrain shifts. In irrigation management, drones guided by VLA models can interpret field maps and verbal instructions to selectively water stressed zones, reducing water usage by up to 30%.

Moreover, VLA models support dynamic reconfiguration and lifelong learning. With access to synthetic training datasets generated from photorealistic simulations of crop environments (e.g., 3D orchard renderings), models can be trained to recognize pests, weeds, and crop maturity stages without extensive manual annotation. Techniques like LoRA adapters and diffusion-based policy tuning further enhance generalization to novel crops, seasons, and geographical regions.

The integration of VLAs into agricultural workflows offers significant benefits: reduced dependence on skilled labor, higher yield through targeted intervention, and enhanced environmental sustainability through optimized input usage. As global food systems grapple with climate variability and resource constraints, VLA-enabled agriculture will play a pivotal role in advancing scalable, intelligent, and sustainable farming practices tailored to real-world complexity.

3.4.6. Interactive AR Navigation with Vision-Language-Action Models

Interactive Augmented Reality (AR) navigation represents a frontier where the VLA models can significantly enhance human-environment interaction by providing intelligent, context-aware guidance in real-time [184, 185, 186]. In this paradigm, VLAs process continuous streams of visual data from AR-enabled devices—such as smart glasses or smartphones—alongside natural language queries to generate dynamic navigational cues overlaid directly onto the user’s view of the physical world. Unlike traditional GPS-based systems that rely on rigid maps and limited user input [187, 188], VLA-based AR agents interpret complex visual scenes (e.g., intersections, indoor hallways, signage) and respond to free-form instructions such as “take me to the nearest pharmacy with a wheelchair ramp” or “show the quietest route to the conference room.”

Technically, these models integrate a vision encoder (e.g., ViT, DINOv2) that extracts scene representations from first-person RGB frames, a language encoder (e.g., T5 or LLaMA) that processes user prompts or voice commands, and an action decoder that predicts tokenized navigation cues such as directional overlays, waypoints, or voice instructions. A transformer-based architecture fuses these modalities to reason about both the spatial layout and semantic intent, allowing the AR agent to adaptively highlight paths, landmarks, and hazards directly within the user's field of view [67, 189]. For example, as shown in Figure 16, in a crowded airport, the VLA agent could visually identify escalators, gates, or baggage claims while understanding a query like “how do I reach Gate 22 without stairs?”, adjusting the route in response to real-time occupancy and obstacles.

VLAs also support interaction loops that enable users to refine instructions (e.g., “avoid busy areas” or “take the scenic route”) and receive context-aware feedback, improving accessibility for the visually impaired or cognitively challenged. In logistics and indoor navigation, these systems can be integrated with IoT sensors and digital twins to guide warehouse workers, maintenance teams, or delivery robots through complex environments. Furthermore, personalized navigation can be achieved through continual fine-tuning, where VLA models learn user preferences and local spatial layouts over time.

As AR hardware becomes more affordable and integrated into daily life, VLA-powered navigation systems will enable seamless spatial understanding, multimodal interaction, and autonomous guidance in public, industrial, and assistive contexts—redefining how humans perceive, explore, and interact with physical spaces.

4. Challenges and Limitations of Vision-Language-Action Models

Show me a brief summary.

In this section, Vision-Language-Action models confront critical barriers preventing their transition from research prototypes to reliable real-world systems. Real-time inference remains constrained by autoregressive decoding that achieves only 3–5 Hz, far below the 100+ Hz required for precise robotic control, while memory demands exceed embedded hardware capabilities. Multimodal action representation struggles with discrete tokenization imprecision and diffusion-based computational overhead, and safety mechanisms introduce dangerous 200–500 ms latencies in dynamic environments. Dataset bias pervades training corpora, causing 23% object reference failures in novel settings and 40% performance drops on unseen tasks due to overfitting. System integration complexity arises from temporal mismatches between high-level planning (800 ms) and low-level control (10 ms), alongside feature space misalignments that degrade sim-to-real transfer by 32%. Energy demands of 7-billion-parameter models exceed edge device capacity, while environmental variability reduces vision accuracy by 20–30% under poor lighting and occlusion, compounded by ethical concerns regarding privacy and bias propagation.

VLA models face a spectrum of interrelated challenges that impede their translation from research prototypes to robust, real‐world systems. First, achieving real‐time, resource‐aware inference remains difficult: models like DeeR-VLA leverage dynamic early‐exit architectures to cut computation 5–6× on manipulation benchmarks while preserving accuracy, yet their gains diminish in complex scenarios [29]. Similarly, Uni-NaVid compresses egocentric video tokens for 5 Hz navigation but still struggles under highly ambiguous instructions and longer horizons [30]. Coupled with limited object generalization, even advanced hybrid vision‐language grounding schemes (e.g., ObjectVLA) generalize to only 64% of novel objects, underscoring persistent gaps in open‐world robustness [129].

Second, adapting VLA models with minimal supervision and ensuring stable policy updates under scarce, noisy data is nontrivial. ConRFT combines behavior cloning and Q-learning with human‐in‐the‐loop fine‐tuning to rapidly converge to 96.3% success over eight contact‐rich tasks, yet it relies heavily on expert interventions and reward shaping [190]. Hierarchical frameworks such as Hi Robot decouple high‐level reasoning from low‐level execution to improve instruction fidelity, but coordinating these modules and grounding ambiguous feedback remains challenging [191]. Likewise, TLA’s fusion of tactile streams with language commands achieves over 85% success on unseen peg‐in-hole tasks, but dataset breadth and real‐time multi‐step decoding still limit broader generalization [192].

Furthermore, ensuring safety, generalization, and end‐to‐end reliability in dynamic environments demands new modeling and evaluation standards. Occupancy‐Language‐Action models like OccLLaMA unify 3D scene understanding with action planning, yet they must scale to richer scene dynamics and semantic consistency across modalities [32]. RaceVLA pushes high‐speed drone navigation via quantized, iterative control loops, but its visual–physical generalization trails larger VLAs and dedicated reasoners [108]. Model‐merging strategies in ReVLA recover lost out‐of‐domain visual robustness—improving OOD grasp success by up to 77%—but introduce extra computation and complexity [31]. Finally, SafeVLA formulates constraints via constrained Markov decision processes to cut unsafe behavior by over 80%, yet defining comprehensive, non‐restrictive safety rules for diverse real‐world tasks remains an open problem [35]. Addressing these intersecting limitations is critical for VLA models to achieve reliable, autonomous operation against the full complexity of real‐world robotics.

Building upon the critical limitations outlined above, it is imperative to map each challenge to targeted mitigation strategies and forecast their system-level impact. Table 3 distills this mapping into three columns—identifying core limitations, proposing concrete technical remedies drawn from recent advances, and articulating the anticipated benefits for real-world VLA deployment. For instance, tackling real-time inference constraints leverages parallel decoding and quantized transformer pipelines with hardware acceleration (e.g., TensorRT) to sustain control loop rates in drones and manipulators [39, 70, 193, 59]. Addressing multimodal action representation via hybrid diffusion–autoregressive policies enriches a model’s capacity to produce varied, context-sensitive motor commands for complex tasks [86, 17]. To guarantee safety in open worlds, dynamic risk assessment modules and adaptive planning layers can be integrated, ensuring robust emergency stop behaviors in unpredictable settings [194, 195, 196]. Similarly, dataset bias and grounding are countered through curated debiased corpora and advanced contrastive fine-tuning, bolstering fairness and semantic fidelity when generalizing to novel objects and scenes [197, 64, 46]. Together, these solution pathways—and others spanning simulation-to-real transfer, tactile integration, and energy-efficient architectures—frame a comprehensive roadmap for transitioning VLA research into reliable, scalable autonomy.

Table 3: Challenges, Potential Solutions, and Expected Impact of VLA Models

Challenge / Limitation	Potential Solution	Expected Impact
Real-Time Inference Constraints	Adopt parallel decoding, quantized transformers, and hardware acceleration [39, 70](e.g., TensorRT); minimize autoregressive overhead [193, 59]	Supports real-time robotic control and broader deployment in time-sensitive domains [85, 136] (e.g., drones, manipulators)
Multimodal Action Representation	Hybrid tokenization using diffusion and autoregressive policies [86]; train on diverse demonstrations and multi-modal outputs [17]	Improves handling of complex, dynamic manipulation tasks with multiple viable solutions [146]
Safety Assurance in Open Worlds	Integrate dynamic risk assessment modules [194, 195]; low-latency emergency stop circuits and adaptive planning layers [196]	Ensures reliability and safety in unpredictable environments (homes, factories, healthcare settings)

The remainder of this section is organized into five focused subsections, each examining a distinct cluster of VLA challenges identified in the literature. First, we analyze real-time inference constraints and the emerging methods that address them. Next, we delve into multimodal action representation alongside safety assurance in open-world settings. We then discuss dataset bias, grounding strategies, and generalization to unseen tasks, followed by an exploration of system integration complexity and computational demands. Finally, we consider robustness and the ethical implications of deploying VLAs in real-world applications.

4.1. Real-Time Inference Constraints

Real-time inference remains a significant limitation in deploying VLA models, particularly in latency-critical applications like robotic manipulation, autonomous driving, and drone control. VLAs typically depend on autoregressive decoding strategies, which sequentially generate action tokens based on previous predictions. While effective for many tasks, this method severely restricts inference speed, typically achieving only 3–5 Hz. This rate falls dramatically short of the 100 Hz or greater frequency required by robotic systems for precise and fluid real-time control. For instance, when a robotic arm manipulates delicate objects, frequent positional updates are essential to maintain accuracy and prevent damage. Models such as OpenVLA [70] and Pi-0 [119] face inherent challenges with this sequential token generation approach, thereby limiting their effectiveness in dynamic environments.

Emerging solutions such as parallel decoding, exemplified by NVIDIA’s Groot N1 model [40], aim to accelerate inference by predicting multiple tokens simultaneously. Groot N1 achieves approximately a 2.52× speedup over traditional decoding methods; however, this parallelism often introduces trade-offs in trajectory smoothness, resulting in jerky or suboptimal robot movements. Such movements are undesirable in sensitive applications like surgical robotics, where precision and fluidity are paramount. Thus, achieving rapid inference without compromising output quality remains an open challenge.

Additionally, hardware limitations exacerbate real-time inference constraints. For example, processing high-dimensional visual embeddings, typically involving over 400 vision tokens at 512 dimensions each, requires approximately 1.2 GB/s memory bandwidth. This demand significantly exceeds the capacity of current embedded systems or edge-AI hardware such as NVIDIA Jetson platforms, thereby restricting practical deployment. Even with efficient quantization techniques, which reduce the precision of floating-point operations to alleviate memory constraints, models frequently experience accuracy degradation, especially in tasks demanding sub-millimeter precision, such as bimanual robotic manipulation or medical robotics.

4.2. Multimodal Action Representation and Safety Assurance

Multimodal Action Representation: One significant limitation of current VLA models is accurately representing multimodal actions, particularly in scenarios requiring continuous and nuanced control [145, 144]. Traditional discrete tokenization methods, such as those dividing actions into 256 distinct bins, inherently lack precision, creating substantial errors in fine-grained tasks like delicate robotic grasping or intricate surgical procedures [86]. For instance, during precise robotic manipulation in assembly tasks, discrete representations can result in misaligned or imprecise actions, undermining performance and reliability. On the other hand, continuous multilayer perceptron (MLP) based approaches face the risk of mode collapse [217, 106], where models converge prematurely to single action trajectories, despite multiple viable paths available. This diminishes the flexibility necessary for adaptive decision-making in highly dynamic environments. Emerging diffusion-based policies, exemplified by models like Pi-Zero and RDT-1B [135], offer richer multimodal action representation capable of capturing diverse action possibilities. However, their substantial computational overhead—approximately three times that of conventional transformer-based decoders—renders them impractical for real-time deployment. Consequently, VLA models currently struggle with complex dynamic tasks, such as robotic navigation in densely crowded spaces or sophisticated bimanual manipulations [146, 218], where multiple strategic actions may be equally valid and contextually dependent.

Safety Assurance in Open Worlds: Another critical challenge facing VLAs is ensuring robust safety in dynamic, unpredictable environments characteristic of real-world scenarios [219, 35]. Many current implementations depend heavily on predefined hardcoded force and torque thresholds, significantly constraining their adaptability in encountering unforeseen or novel conditions, such as unexpected obstacles or sudden environmental changes [17]. Models used for collision prediction typically attain only about 82% accuracy in cluttered and dynamic spaces, posing serious risks in applications such as warehouse logistics or domestic robotics, where safety margins are minimal [104, 70]. Moreover, the essential safety mechanisms like emergency stops incorporate substantial latency—often between 200 and 500 milliseconds—due to comprehensive safety verifications [220, 70]. This delay, although seemingly minor, can prove hazardous in high-speed operations or critical interventions, such as automated driving or emergency robotic responses.

4.3. Dataset Bias, Grounding, and Generalization to Unseen Tasks

A significant obstacle limiting the effectiveness of VLA models is the pervasive presence of dataset bias and grounding deficiencies. Current training datasets, predominantly sourced from web-crawled repositories, frequently exhibit inherent biases [221, 222]. Studies indicate that approximately 17% of associations within standard datasets are skewed toward stereotypical interpretations, such as disproportionately associating terms like "doctor" with male figures [223, 224]. These biases propagate through training, resulting in VLAs that produce semantically misaligned or contextually inappropriate responses when deployed in diverse environments. For instance, models such as OpenVLA have been documented to overlook approximately 23% of object references in novel settings, significantly impairing their practical utility in real-world applications where accurate interpretation of instructions is critical [70]. This grounding issue also extends to challenges in compositional generalization, where VLAs often falter when encountering rare or unconventional combinations, such as interpreting a phrase like "yellow horse" because of underrepresentation in training corpora. These shortcomings highlight an urgent need for carefully curated, balanced, and comprehensive datasets, coupled with advanced grounding algorithms designed to mitigate biases and enhance semantic alignment across varied contexts.

Complementing the challenges posed by dataset bias is the broader issue of generalization to unseen tasks, a critical barrier for the practical deployment of VLAs. While existing models demonstrate proficiency in familiar environments or tasks similar to their training scenarios, their performance significantly degrades—often by as much as 40%—when encountering entirely novel tasks or unfamiliar variations. For example, a VLA trained specifically on domestic tasks may struggle or outright fail when introduced into industrial or agricultural settings, largely due to discrepancies in object types, environmental dynamics, and operational constraints. This limitation arises primarily from overfitting to narrowly scoped training distributions and insufficient exposure to diverse task representations. Consequently, current VLAs exhibit limited proficiency in zero-shot or few-shot learning scenarios, impeding their adaptability and scalability.

4.4. System Integration Complexity and Computational Demands

Integrating VLA models within dual-system architectures, which combine high-level cognitive planning (System 2) and real-time physical control (System 1), presents significant complexity in robotic applications. A primary challenge arises from temporal mismatches between these two systems. Typically, System 2 leverages large language models (LLMs) such as GPT or Llama-2 for complex task decomposition and strategic planning. These models, due to their substantial computational requirements, often exhibit processing times of approximately 800 ms or more per inference cycle. Conversely, System 1 components, tasked with executing rapid, low-level motor actions through control loops, operate on millisecond timescales—often around 10 ms intervals. This stark discrepancy in operational cadence leads to synchronization difficulties, causing delays and potentially suboptimal execution trajectories. For example, NVIDIA's Groot N1 model demonstrates an effective integration of these two systems but still suffers from occasional jerkiness in motion due to asynchronous interaction, highlighting this intrinsic challenge.

Furthermore, the feature space misalignment between high-dimensional vision encoders, such as Vision Transformers (ViT), and lower-dimensional action decoders exacerbates integration complexity. When attempting to reconcile these disparate embeddings, the coherence between perceptual understanding and actionable commands can deteriorate significantly. OpenVLA [70] and RoboMamba [44], which utilize transformer-based visual processing and subsequent action decoding, illustrate these integration challenges—resulting in diminished performance when ported from simulation environments to physical hardware deployments. Such discrepancies manifest as high as a 32% reduction in performance, primarily due to mismatches between simulated dynamics and real-world sensor noise or calibration issues.

Energy and compute demands constitute another significant barrier for VLA deployment, particularly in edge computing contexts typical of autonomous drones, mobile robots, and wearable robotic systems. The substantial parameter counts typical of advanced VLAs—for instance, models possessing upwards of 7 billion parameters—necessitate computational resources often exceeding 28 GB of VRAM in their native form. These requirements vastly outpace the capabilities of most current edge-oriented processors and GPUs, restricting the practical applicability of sophisticated VLAs outside specialized, high-resource environments.

4.5. Robustness and Ethical Challenges in VLA Deployment

The practical deployment of VLA models faces substantial challenges regarding robustness to environmental variability and ethical considerations. Environmental robustness refers to the VLA's capacity to maintain stable and accurate performance across dynamically changing conditions. Real-world environments frequently introduce unpredictable variations such as fluctuating lighting, weather conditions, or partial occlusions. For instance, vision modules within VLAs, such as those employed by OpenDriveVLA [34], exhibit accuracy reductions of approximately 20–30% under low-contrast or shadow-heavy scenarios due to inadequate processing capabilities of current visual encoders. Similarly, linguistic comprehension in VLAs like CoVLA [33] is adversely affected in acoustically noisy or ambiguous contexts, where instructions can become difficult to interpret accurately, leading to task execution errors. Additionally, robotic manipulation tasks using VLA-equipped systems such as RoboMamba [44] frequently struggle with cluttered environments, misjudging positions or orientations of partially occluded objects, thereby compromising task success.

💭 Click to ask about this figure

5. Discussion

Show me a brief summary.

In this section, VLA models are shown to face six major challenges spanning real-time inference limitations, multimodal action representation and safety vulnerabilities, dataset bias and grounding errors, system integration complexity, computational demands, and robustness alongside ethical concerns, all of which collectively hinder practical deployment in real-world robotics and autonomous systems. To address these barriers, the discussion proposes targeted solutions including hardware accelerators and model compression techniques like LoRA and quantization to achieve sub-50 ms inference, hybrid policy architectures combining diffusion-based sampling with autoregressive planners for safe multimodal action representation, curated debiased datasets with meta-learning and sim-to-real fine-tuning for improved generalization, hardware-software co-design with modular adapters for efficient system integration, and domain randomization coupled with bias auditing and privacy-preserving inference to ensure robustness and ethical deployment. The future roadmap envisions VLAs evolving into generalist robotic intelligence through multimodal foundation models, agentic lifelong learning, hierarchical neuro-symbolic planning, real-time world models, cross-embodiment transfer, and built-in safety alignment, ultimately enabling human-centered, adaptable, and AGI-capable embodied agents.

As illustrated in Figure 17, VLA models face a multifaceted set of challenges that span algorithmic, computational, and ethical dimensions. First, achieving real-time inference on resource-constrained hardware remains difficult due to the sequential nature of autoregressive decoders and the high dimensionality of multimodal inputs. Second, fusing vision, language, and action into coherent policies introduces safety vulnerabilities when encountering unanticipated environmental changes. Third, dataset bias and grounding errors compromise generalization, often causing models to fail on out-of-distribution tasks. Fourth, integrating diverse components—perception, reasoning, control—yields complex architectures that are hard to optimize and maintain. Fifth, the energy and compute demands of large VLA systems hinder deployment on embedded or mobile platforms. Finally, robustness to environmental variability and ethical considerations, such as privacy and bias mitigation, add layers of societal and regulatory concern. Collectively, these limitations constrain the practical adoption of VLA models in real-world robotics, autonomous systems, and interactive applications. The potential solutions to these challenges are discussed in the below points.

5.1. Potential Solutions

Real-Time Inference Constraints. Future research must develop VLA architectures that harmonize latency, throughput, and task-specific accuracy. One promising direction is the integration of specialized hardware accelerators—such as FPGA-based vision processors and tensor cores optimized for sparse matrix operations—to execute convolutional and transformer layers at sub-millisecond scales [70, 39]. Model compression techniques like Low-Rank Adaptation (LoRA) [118] and knowledge distillation can shrink parameter counts by up to 90%, reducing both memory footprint and inference time while retaining over 95% of original performance on benchmark tasks. Progressive quantization strategies that combine mixed-precision arithmetic (e.g., FP16/INT8) with block-wise calibration can further cut computation by 2–4× with minimal accuracy loss [69]. Adaptive inference architectures that dynamically adjust network depth or width based on input complexity—akin to early-exit branches in DeeR-VLA [29]—can reduce average compute by selectively bypassing transformer layers when visual scenes or linguistic commands are simple. Finally, efficient tokenization schemes leveraging subword patch embeddings and dynamic vocabulary allocation can compress visual and linguistic input into compact representations, minimizing token counts without sacrificing semantic richness [86]. Together, these innovations can enable sub-50 ms end-to-end inference on commodity edge GPUs, paving the way for latency-sensitive applications in autonomous drone flight, real-time teleoperation, and collaborative manufacturing.
Multimodal Action Representation and Safety Assurance. Addressing multimodal action representation and robust safety requires end-to-end frameworks that unify perception, reasoning, and control under stringent safety constraints. Hybrid policy architectures combining diffusion-based sampling for low-level motion primitives [26] with autoregressive high-level planners [65] enable compact stochastic representations of diverse action trajectories, improving adaptability in dynamic environments. Safety can be enforced via real-time risk assessment modules that ingest multi-sensor fusion streams—visual, depth, and proprioceptive data—to predict collision probability and joint stress thresholds, triggering emergency stop circuits when predefined safety envelopes are breached [194, 195]. Reinforcement learning algorithms augmented with constrained optimization (e.g., Lagrangian methods in SafeVLA [35]) can learn policies that maximize task success while strictly respecting safety constraints. Online model adaptation techniques—such as rule-based RL (GRPO) and Direct Preference Optimization (DPO)—further refine action selection under new environmental conditions, ensuring consistent safety performance across scenarios [196]. Crucially, embedding formal verification layers that symbolically analyze planner outputs before execution can guarantee compliance with safety invariants, even for neural-network–based controllers. Integrating these methodologies will produce VLA systems that not only execute complex, multimodal actions but do so with provable safety in unstructured, real-world settings.
Dataset Bias, Grounding, and Generalization to Unseen Tasks. Robust generalization demands both broadened data diversity and advanced learning paradigms. Curating large-scale, debiased multimodal datasets—combining web-scale image–text corpora like LAION-5B [116] with robot-centric trajectory archives such as Open X-Embodiment [97]—lays the groundwork for equitable semantic grounding. Hard-negative sampling and contrastive fine-tuning of vision–language backbones (e.g., CLIP variants) can mitigate spurious correlations and enhance semantic fidelity [64, 79]. Meta-learning frameworks enable rapid adaptation to novel tasks by learning shared priors across task families, as demonstrated in vision-language robotic navigation models [46]. Continual learning algorithms—with replay buffers and regularization strategies—preserve old knowledge while integrating new concepts, addressing catastrophic forgetting in VLA models [31]. Transfer learning from 3D perception domains (e.g., point cloud reasoning in 3D-VLA [104]) can imbue models with spatial inductive biases, improving out-of-distribution robustness. Finally, simulation-to-real (sim2real) fine-tuning with domain randomization and real-world calibration—such as dynamic lighting, texture, and physics variations—ensures that policies learned in synthetic environments transfer effectively to physical robots [203, 204]. These combined strategies will empower VLAs to generalize confidently to unseen objects, scenes, and tasks in real-world deployments.
System Integration Complexity and Computational Demands. To manage the intricate orchestration of multimodal pipelines under tight compute budgets, researchers must embrace model modularization and hardware–software co–design. Low-Rank Adaptation (LoRA) adapters can be injected into pre–trained transformer layers, enabling task-specific fine–tuning without modifying core weights [118]. Knowledge distillation from large “teacher” VLAs into lightweight “student” networks—using student–teacher mutual information objectives—yields compact models with 5–10× fewer parameters while retaining 90–95% [69]. Mixed-precision quantization augmented by quantization-aware training can compress weights to 4–8 bits, cutting memory bandwidth and energy consumption by over 60% [70]. Hardware accelerators tailored for VLA workloads—supporting sparse tensor operations, dynamic token routing, and fused vision–language kernels—can deliver sustained 100+ TOPS throughput within a 20–30 W power envelope, meeting the demands of embedded robotic platforms [86, 65]. Toolchains like TensorRT-LLM [39] and TVM can optimize end-to-end VLA graphs for specific edge devices, fusing layers and precomputing static subgraphs. Emerging architectures such as TinyVLA demonstrate that sub-1B parameter VLAs can achieve near–state-of-the-art performance on manipulation benchmarks with real–time inference, charting a path for widespread deployment in resource-constrained settings.
Robustness and Ethical Challenges in VLA Deployment. Ensuring VLA robustness and ethical integrity requires both technical and governance measures. Domain randomization and synthetic augmentation pipelines—like UniSim’s closed–loop sensor simulator—generate photorealistic variations in lighting, occlusion, and sensor noise, enhancing model resilience to environmental shifts [117]. Adaptive recalibration modules, which adjust perception thresholds and control gains based on real-time feedback, further mitigate drift and sensor degradation over prolonged operation. On the ethical front, bias auditing tools must scan training datasets for skewed demographic or semantic distributions, followed by corrective fine-tuning using adversarial debiasing and counterfactual augmentation [197, 79]. Privacy-preserving inferencing—via on–device processing, homomorphic encryption for sensitive data streams, and differential privacy during training—safeguards user data in applications like healthcare and smart homes [214, 215]. Socioeconomic impacts can be managed through transparent impact assessments and stakeholder engagement, ensuring that VLA adoption complements human labor through upskilling programs rather than displacing workers en masse. Finally, establishing regulatory frameworks and industry standards for VLA safety and accountability will underpin responsible innovation, balancing technical capabilities with societal values.

5.2. Future Roadmap

The future of VLA models lies at the intersection of increasingly powerful multimodal foundations, agentic reasoning, and embodied continual learning. Over the next decade, we anticipate several converging trends that will propel VLAs from capable but narrow task specialists toward the core of truly generalist robotic intelligence.

Multimodal Foundation Models as the “Cortex.” Today’s VLAs typically couple a vision-language backbone with task-specific policy heads. Tomorrow, we expect a single, massive multimodal foundation model—trained on web-scale image, video, text, and affordance data—to serve as a shared perceptual and conceptual “cortex.” This foundation will encode not only static scenes but also dynamics, physics, and common-sense world knowledge, enabling downstream action learners to tap into a unified representational substrate rather than reinventing basic perceptual skills for every robot or domain.
Agentic, Self-Supervised Lifelong Learning. Rather than static pretraining, future VLAs will engage in continual, self-supervised interaction with their environments. Agentic frameworks—where the model generates its own exploration objectives, hypothesizes outcomes, and self-corrects via simulated or real rollouts—will drive rapid skill acquisition. By formulating internal sub-goals (“learn to open drawers, ” “map furniture affordances”) and integrating reinforcement-style feedback, a VLA-driven humanoid could autonomously expand its capabilities over years of deployment, much like a human apprentice.
Hierarchical, Neuro-Symbolic Planning. To scale from low-level motor primitives to high-level reasoning, VLAs will adopt hierarchical control architectures. A top-level language-grounded planner (perhaps an LLM variant fine-tuned for affordance reasoning) will decompose complex instructions (“prepare a cup of tea”) into sequences of sub-tasks (“fetch kettle, ” “fill water, ” “heat water, ” “steep tea bag”). Mid-level modules will translate these into parameterized motion plans, and low-level diffusion or transformer-based controllers will generate smooth, compliant trajectories in real time. This neuro-symbolic blend ensures both the interpretability of structured plans and the flexibility of learned policies.
Real-Time Adaptation via World Models. Robustness in unstructured settings demands that VLAs maintain an internal, predictive world model—an up-to-date simulation of objects, contacts, and agent dynamics. As the robot acts, it will continuously reconcile its predictions with sensor feedback, using model-based corrective actions when discrepancies arise (e.g., slipping grasp). Advances in differentiable physics and video-to-state encoders will make these world models both accurate and efficient enough for on-board, real-time use. Cross-Embodiment and Transfer Learning: The era of training separate VLAs for each robot morphology will give way to embodiment-agnostic policies. By encoding actions in an abstract, kinematic-agnostic space (e.g., “apply grasp force at these affordance points”), future VLAs will transfer skills seamlessly between wheeled platforms, quadrupeds, and humanoids. Combined with meta-learning, a new robot can bootstrap prior skills with only a few minutes of calibration data. Safety, Ethics, and Human-Centered Alignment As VLAs gain autonomy, built-in safety and value alignment become non-negotiable. Future systems will integrate real-time risk estimators—assessing potential harm to humans or property before executing high-risk maneuvers—and seek natural language consent for ambiguous situations. Regulatory constraints and socially aware policies will be baked into the VLA stack, ensuring that robots defer to human preferences and legal norms.

Figure 18: This conceptual illustration presents “Eva,” a future humanoid assistant powered by Vision-Language Models (VLMs), VLA frameworks, and agentic AI systems. VLMs enable semantic scene understanding and object affordance prediction, while VLAs translate language-grounded instructions into hierarchical motor plans. Agentic AI modules ensure adaptive learning, self-refinement, and interactive decision-making in open-ended environments. Together, these components represent a foundational blueprint for Artificial General Intelligence (AGI) in robotics, where perception, language understanding, planning, and safe autonomous behavior converge in real-world, socially aware tasks.

💭 Click to ask about this figure

As illustrated in Figure 18, the future of VLA-based robotics lies in the integration of three foundational components: Vision-Language Models (VLMs), VLA architectures, and agentic AI systems. Consider “Eva, ” a generalist humanoid assistant operating in a household. At the perception layer, Eva’s foundation VLM interprets multimodal inputs by segmenting visual scenes into discrete object-level representations, predicting affordances (e.g., graspable, fragile), and simulating dynamic behaviors through an internal world model. This VLM layer enables high-level visual understanding grounded in language semantics and physical properties. Upon receiving a user command such as “Eva, clean the coffee spill and water the plants, ” the VLA module activates. This core architecture combines tokenized language inputs and sensory feedback to perform hierarchical task planning. A high-level planner decomposes the instruction into actionable subtasks (e.g., locate cloth, wipe spill, retrieve watering can), which are then converted into motion trajectories via a mid-level policy module. These plans are passed to a low-level diffusion-policy controller, responsible for generating smooth, physics-aware joint movements tailored to the robot’s embodiment. Complementing these is Eva’s agentic AI module, which supports continual learning and adaptation. When confronted with unexpected challenges—like a sticky stain—Eva invokes an internal self-improvement loop, running real-time simulated variations to refine its wiping strategy without human supervision. Safety and alignment are ensured through human-aware policies: proximity sensors, real-time monitoring, and verbal confirmations before high-risk actions. Overnight, Eva performs autonomous review of performance logs, refining sub-policies via simulated rollouts. Together, this VLM-VLA-agentic triad marks a significant leap toward embodied AGI. It enables robots like Eva to perceive, plan, act, adapt, and safely coexist with humans, ultimately transforming how intelligent systems interact with real-world environments in robust, interpretable, and human-aligned ways.

6. Conclusion

Show me a brief summary.

In this section, the authors synthesize a comprehensive three-year review of Vision-Language-Action (VLA) models, tracing their evolution from isolated perception-action modules to unified, instruction-following robotic agents capable of integrating visual perception, natural language understanding, and physical action generation. The review systematically examines foundational concepts, tokenization strategies, learning paradigms—including supervised, imitation, and reinforcement learning—and architectural innovations across over 50 recent models, while addressing adaptive control, real-time execution, and deployment across six application domains: humanoid robotics, autonomous vehicles, industrial automation, healthcare, agriculture, and augmented reality navigation. Critical challenges are identified in real-time inference, safety assurance, bias mitigation, system integration, and ethical deployment, with proposed solutions encompassing model compression, cross-modal grounding, and agentic learning frameworks. The conclusion envisions VLA advancement as a convergence of vision-language models, adaptive architectures, and agentic AI systems, steering embodied robotics toward artificial general intelligence through intelligent, human-aligned, and contextually aware agents.

In this comprehensive review, we systematically evaluated the recent developments, methodologies, and applications of VLA models published over the last three years. Our analysis began with the foundational concepts of VLAs, defining their role as multimodal systems that unify visual perception, natural language understanding, and action generation in physical or simulated environments. We traced their evolution and timeline, detailing key milestones that marked the transition from isolated perception-action modules to fully unified, instruction-following robotic agents. We highlighted how multimodal integration has matured—from loosely coupled pipelines to transformer-based architectures that enable seamless coordination between modalities.

Next, we examined tokenization and representation techniques, focusing on how VLAs encode visual and linguistic information, including action primitives and spatial semantics. We explored learning paradigms, detailing the datasets and training strategies—from supervised learning and imitation learning to reinforcement learning and multimodal pretraining—that have shaped VLA performance. In our section on adaptive control and real-time execution, we addressed how modern VLAs are optimized for dynamic environments, discussing policies that support latency-sensitive tasks. We then categorized major architectural innovations, surveying over 50 recent VLA models. This included advancements in model design, memory systems, and interaction fidelity. We further studied training and efficiency strategies, including parameter-efficient methods like LoRA, quantization, and model pruning, alongside acceleration techniques such as parallel decoding and hardware-aware inference. Our analysis continued with real-world applications of VLA models, showcasing their deployment across six domains: humanoid robotics, autonomous vehicles, industrial automation, healthcare, agriculture, and augmented reality (AR) navigation. Each application was reviewed with examples of model performance, domain-specific challenges, and generalizability.

In addressing challenges and limitations, we focused on five core areas: real-time inference, multimodal action representation and safety, bias and generalization, system integration and compute constraints, and ethical deployment. We proposed potential solutions drawn from current literature, including model compression, cross-modal grounding, domain adaptation, and agentic learning frameworks. Finally, our discussion and future roadmap articulated how the convergence of VLMs, VLA architectures, and agentic AI systems is steering robotics toward artificial general intelligence (AGI). This review provides a unified understanding of VLA advancements, identifies unresolved challenges, and outlines a structured path forward for developing intelligent, embodied, and human-aligned agents.

Acknowledgement

Show me a brief summary.

In this section, the authors acknowledge the financial support that enabled the research presented in the document. The work was funded by two major sources: the National Science Foundation and the United States Department of Agriculture's National Institute of Food and Agriculture. This support came through the Artificial Intelligence Institute for Agriculture Program, specifically under two awards—AWD003473 and AWD004595, with Accession Number 1029004. The funding was designated for a project titled "Robotic Blossom Thinning with Soft Manipulators," which aligns with the broader vision-language-action model research discussed throughout the document, particularly its application to agricultural robotics. This acknowledgment underscores the institutional backing necessary for advancing embodied AI systems capable of performing complex, real-world tasks in agricultural settings, thereby connecting foundational research in VLA models to practical, funded initiatives addressing industry-specific challenges.

This work was supported by the National Science Foundation and the United States Department of Agriculture, National Institute of Food and Agriculture through the ``Artificial Intelligence (AI) Institute for Agriculture” Program under Award AWD003473, and AWD004595, Accession Number 1029004, "Robotic Blossom Thinning with Soft Manipulators".

Declarations

Show me a brief summary.

In this section, the authors provide a formal declaration of conflicts of interest, stating that none exist in relation to the research presented in this comprehensive review of vision-language-action models. This declaration serves as a standard ethical disclosure required in academic publications to ensure transparency and maintain the integrity of the research process. By explicitly confirming the absence of any financial, professional, or personal interests that could influence or appear to influence the work, the authors establish that the review was conducted without bias or external pressures that might compromise the objectivity of their analysis, findings, or recommendations regarding VLA model developments, applications, and future directions in embodied AI and robotic systems.

The authors declare no conflicts of interest.

Statement on AI Writing Assistance

Show me a brief summary.

In this section, the authors disclose their use of AI writing tools to enhance the manuscript's linguistic quality and visual presentation. ChatGPT and Perplexity were employed to improve grammatical accuracy and refine sentence structure, with all AI-generated revisions subject to thorough human review and editing to ensure relevance and correctness. Additionally, ChatGPT-4o was utilized to generate realistic visualizations that support the document's technical content. This transparent acknowledgment establishes that while AI tools assisted in polishing language and creating illustrative figures, the substantive intellectual contributions, technical analysis, and scientific integrity of the work remain entirely under human oversight, ensuring that AI-generated content serves only as a refinement layer rather than a replacement for expert judgment and domain knowledge.

ChatGPT and Perplexity were utilized to enhance grammatical accuracy and refine sentence structure; all AI-generated revisions were thoroughly reviewed and edited for relevance. Additionally, ChatGPT-4o was employed to generate realistic visualizations.

Appendix Table

The following appendix tables present a comprehensive overview of recent developments and challenges in VLA models. Table 2 systematically compares state-of-the-art VLA methodologies, their application domains, and key innovations across robotics, autonomous systems, and embodied AI platforms. This comparative summary highlights core architectural advances, deployment contexts, and technical contributions—providing valuable insight into the evolving landscape of generalist and task-specific VLA models. Meanwhile, Table 3 presents a structured synthesis of the major technical and practical challenges facing VLA model deployment, alongside potential solutions and their expected impact. This includes limitations such as real-time inference constraints, multimodal integration issues, and ethical concerns, with proposed resolutions ranging from architectural innovations to scalable training techniques and cross-modal alignment strategies. Together, these tables serve as a detailed reference for researchers, developers, and practitioners aiming to understand both the current capabilities and outstanding barriers in VLA-based intelligent systems.

References

[1] Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T., 2015. Long-term recurrent convolutional networks for visual recognition and description, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2625–2634. %Type = Article

[2] Hanson, A., Riseman, E., 2014. The visions image-understanding system, in: Advances in Computer Vision. Psychology Press, pp. 1–114. %Type = Article

[3] Sutskever, I., Martens, J., Hinton, G.E., 2011. Generating text with recurrent neural networks, in: Proceedings of the 28th international conference on machine learning (ICML-11), pp. 1017–1024. %Type = Article

[4] Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al., 2018. Improving language understanding by generative pre-training . %Type = Phdthesis

[5] Duarte, N.F., Raković, M., Tasevski, J., Coco, M.I., Billard, A., Santos-Victor, J., 2018. Action anticipation: Reading the intentions of humans and robots. IEEE Robotics and Automation Letters 3, 4132–4139. %Type = Article

[6] Doveh, S., Arbelle, A., Harary, S., Schwartz, E., Herzig, R., Giryes, R., Feris, R., Panda, R., Ullman, S., Karlinsky, L., 2023. Teaching structured vision & language concepts to vision & language models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2657–2668. %Type = Article

[7] Cao, J., Gan, Z., Cheng, Y., Yu, L., Chen, Y.C., Liu, J., 2020. Behind the scene: Revealing the secrets of pre-trained vision-and-language models, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, Springer. pp. 565–580. %Type = Article

[8] Shin, H.C., Roth, H.R., Gao, M., Lu, L., Xu, Z., Nogues, I., Yao, J., Mollura, D., Summers, R.M., 2016. Deep convolutional neural networks for computer-aided detection: Cnn architectures, dataset characteristics and transfer learning. IEEE transactions on medical imaging 35, 1285–1298. %Type = Inproceedings

[9] Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang, X., Wang, G., Cai, J., et al., 2018. Recent advances in convolutional neural networks. Pattern recognition 77, 354–377. %Type = Article

[10] Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., et al., 2024. A survey on evaluation of large language models. ACM transactions on intelligent systems and technology 15, 1–45. %Type = Article

[11] Huang, S., Dong, L., Wang, W., Hao, Y., Singhal, S., Ma, S., Lv, T., Cui, L., Mohammed, O.K., Patra, B., et al., 2023a. Language is not all you need: Aligning perception with language models. Advances in Neural Information Processing Systems 36, 72096–72109. %Type = Article

[12] Mohammed, M.Q., Chung, K.L., Chyi, C.S., 2020. Review of deep reinforcement learning-based object grasping: Techniques, open challenges, and recommendations. Ieee Access 8, 178450–178481. %Type = Article

[13] Luo, J., Xu, C., Wu, J., Levine, S., 2024. Precise and dexterous robotic manipulation via human-in-the-loop reinforcement learning. arXiv preprint arXiv:2410.21845 . %Type = Article

[14] Sapkota, R., Roumeliotis, K.I., Cheppally, R.H., Calero, M.F., Karkee, M., 2025. A review of 3d object detection with vision-language models. arXiv preprint arXiv:2504.18738 . %Type = Article

[15] Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L., Xia, F., 2024a. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14455–14465. %Type = Article

[16] Sapkota, R., Karkee, M., 2025. Object detection with multimodal large vision-language models: An in-depth review. Available at SSRN 5233953 . %Type = Article

[17] Ma, Y., Song, Z., Zhuang, Y., Hao, J., King, I., 2024. A survey on vision-language-action models for embodied ai. arXiv preprint arXiv:2405.14093 . %Type = Article

[18] Li, Z., Wu, X., Du, H., Nghiem, H., Shi, G., 2025f. Benchmark evaluations, applications, and challenges of large vision language models: A survey. arXiv preprint arXiv:2501.02189 1. %Type = Article

[19] Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., et al., 2023. Rt-2: Vision-language-action models transfer web knowledge to robotic control, in: Conference on Robot Learning, PMLR. pp. 2165–2183.

[20] Jeong, H., Lee, H., Kim, C., Shin, S., 2024. A survey of robot intelligence with large language models. Applied Sciences 14, 8868. %Type = Article

[21] Xu, Z., Wu, K., Wen, J., Li, J., Liu, N., Che, Z., Tang, J., 2024b. A survey on robotics with foundation models: toward embodied ai. arXiv preprint arXiv:2402.02385 . %Type = Article

[22] Shridhar, M., Manuelli, L., Fox, D., 2022. Cliport: What and where pathways for robotic manipulation, in: Conference on robot learning, PMLR. pp. 894–906. %Type = Article

[23] Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S.G., Novikov, A., Barth-Maron, G., Gimenez, M., Sulsky, Y., Kay, J., Springenberg, J.T., et al., 2022. A generalist agent. arXiv preprint arXiv:2205.06175 . %Type = Article

[24] Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., et al., 2022. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817 . %Type = Article

[25] Jiang, Y., Gupta, A., Zhang, Z., Wang, G., Dou, Y., Chen, Y., Fei-Fei, L., Anandkumar, A., Zhu, Y., Fan, L., 2022. Vima: General robot manipulation with multimodal prompts. arXiv preprint arXiv:2210.03094 2, 6. %Type = Article

[26] Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y., Burchfiel, B., Tedrake, R., Song, S., 2023. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research , 02783649241273668. %Type = Article

[27] Zhao, T.Z., Kumar, V., Levine, S., Finn, C., 2023. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705 . %Type = Article

[28] Huang, W., Wang, C., Zhang, R., Li, Y., Wu, J., Fei-Fei, L., 2023b. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973 . %Type = Article

[29] Yue, Y., Wang, Y., Kang, B., Han, Y., Wang, S., Song, S., Feng, J., Huang, G., 2024. Deer-vla: Dynamic inference of multimodal large language models for efficient robot execution. Advances in Neural Information Processing Systems 37, 56619–56643. %Type = Article

[30] Zhang, J., Wang, K., Wang, S., Li, M., Liu, H., Wei, S., Wang, Z., Zhang, Z., Wang, H., 2024b. Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks. arXiv preprint arXiv:2412.06224 . %Type = Article

[31] Dey, S., Zaech, J.N., Nikolov, N., Van Gool, L., Paudel, D.P., 2024. Revla: Reverting visual domain limitation of robotic foundation models. arXiv preprint arXiv:2409.15250 . %Type = Article

[32] Wei, J., Yuan, S., Li, P., Hu, Q., Gan, Z., Ding, W., 2024. Occllama: An occupancy-language-action generative world model for autonomous driving. arXiv preprint arXiv:2409.03272 . %Type = Article

[33] Arai, H., Miwa, K., Sasaki, K., Watanabe, K., Yamaguchi, Y., Aoki, S., Yamamoto, I., 2025. Covla: Comprehensive vision-language-action dataset for autonomous driving, in: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), IEEE. pp. 1933–1943. %Type = Article

[34] Zhou, X., Han, X., Yang, F., Ma, Y., Knoll, A.C., 2025a. Opendrivevla: Towards end-to-end autonomous driving with large vision language action model. arXiv preprint arXiv:2503.23463 . %Type = Article

[35] Zhang, B., Zhang, Y., Ji, J., Lei, Y., Dai, J., Chen, Y., Yang, Y., 2025a. Safevla: Towards safety alignment of vision-language-action model via safe reinforcement learning. arXiv preprint arXiv:2503.03480 . %Type = Article

[36] Ding, P., Ma, J., Tong, X., Zou, B., Luo, X., Fan, Y., Wang, T., Lu, H., Mo, P., Liu, J., et al., 2025b. Humanoid-vla: Towards universal humanoid control with visual integration. arXiv preprint arXiv:2502.14795 . %Type = Inproceedings

[37] Budzianowski, P., Maa, W., Freed, M., Mo, J., Xie, A., Tipnis, V., Bolte, B., 2024. Edgevla: Efficient vision-language-action models. environments 20, 3. %Type = Article

[38] Li, Q., Liang, Y., Wang, Z., Luo, L., Chen, X., Liao, M., Wei, F., Deng, Y., Xu, S., Zhang, Y., et al., 2024d. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv preprint arXiv:2411.19650 . %Type = Article

[39] Li, J., Zhu, Y., Tang, Z., Wen, J., Zhu, M., Liu, X., Li, C., Cheng, R., Peng, Y., Feng, F., 2024c. Improving vision-language-action models via chain-of-affordance. arXiv preprint arXiv:2412.20451 . %Type = Article

[40] Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al., 2025. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 . %Type = Article

[41] Lin, K.Q., Li, L., Gao, D., Yang, Z., Wu, S., Bai, Z., Lei, W., Wang, L., Shou, M.Z., 2024. Showui: One vision-language-action model for gui visual agent. arXiv preprint arXiv:2411.17465 . %Type = Article

[42] Team, O.M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., et al., 2024. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213 . %Type = Article

[43] Ding, P., Zhao, H., Zhang, W., Song, W., Zhang, M., Huang, S., Yang, N., Wang, D., 2024b. Quar-vla: Vision-language-action model for quadruped robots, in: European Conference on Computer Vision, Springer. pp. 352–367. %Type = Inproceedings

[44] Liu, J., Liu, M., Wang, Z., An, P., Li, X., Zhou, K., Yang, S., Zhang, R., Guo, Y., Zhang, S., 2024a. Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation. Advances in Neural Information Processing Systems 37, 40085–40110. %Type = Article

[45] Wu, Z., Zhou, Y., Xu, X., Wang, Z., Yan, H., 2025. Momanipvla: Transferring vision-language-action models for general mobile manipulation. arXiv preprint arXiv:2503.13446 . %Type = Article

[46] Qu, D., Song, H., Chen, Q., Yao, Y., Ye, X., Ding, Y., Wang, Z., Gu, J., Zhao, B., Wang, D., et al., 2025. Spatialvla: Exploring spatial representations for visual-language-action model. arXiv preprint arXiv:2501.15830 . %Type = Article

[47] Lin, Y., Zhou, H., Chen, M., Min, H., 2019. Automatic sorting system for industrial robot with 3d visual perception and natural language interaction. Measurement and Control 52, 100–115. %Type = Article

[48] Cangelosi, A., Metta, G., Sagerer, G., Nolfi, S., Nehaniv, C., Fischer, K., Tani, J., Belpaeme, T., Sandini, G., Nori, F., et al., 2010. Integration of action and language knowledge: A roadmap for developmental robotics. IEEE Transactions on Autonomous Mental Development 2, 167–195. %Type = Inproceedings

[49] Tellex, S., Gopalan, N., Kress-Gazit, H., Matuszek, C., 2020. Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3, 25–55. %Type = Article

[50] Rawal, P.K., 2025. An Intelligent Versatile Pipeline for 6D Localization of Industrial Components in a Production Environment. Ph.D. thesis. Fraunhofer Verlag. %Type = Article

[51] Katiyar, N., 2023. A Model-Driven Framework for Domain-Specific Adaptation of Time Series Forecasting Pipeline. McGill University (Canada). %Type = Article

[52] Wu, J., Zhong, M., Xing, S., Lai, Z., Liu, Z., Chen, Z., Wang, W., Zhu, X., Lu, L., Lu, T., et al., 2024a. Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks. Advances in Neural Information Processing Systems 37, 69925–69975. %Type = Article

[53] Li, J., Wei, P., Han, W., Fan, L., 2023. Intentqa: Context-aware video intent reasoning, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 11963–11974. %Type = Article

[54] Woo, S., Debnath, S., Hu, R., Chen, X., Liu, Z., Kweon, I.S., Xie, S., 2023. Convnext v2: Co-designing and scaling convnets with masked autoencoders, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16133–16142. %Type = Article

[55] Han, S., Wang, M., Zhang, J., Li, D., Duan, J., 2024. A review of large language models: Fundamental architectures, key technological evolutions, interdisciplinary technologies integration, optimization and compression techniques, applications, and challenges. Electronics 13, 5040. %Type = Incollection

[56] Ni, F., Hao, J., Wu, S., Kou, L., Yuan, Y., Dong, Z., Liu, J., Li, M., Zhuang, Y., Zheng, Y., 2024. Peria: Perceive, reason, imagine, act via holistic language and vision planning for manipulation. Advances in Neural Information Processing Systems 37, 17541–17571. %Type = Article

[57] Zhao, Q., Lu, Y., Kim, M.J., Fu, Z., Zhang, Z., Wu, Y., Li, Z., Ma, Q., Han, S., Finn, C., et al., 2025b. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. arXiv preprint arXiv:2503.22020 . %Type = Article

[58] Li, Y., Lai, Z., Bao, W., Tan, Z., Dao, A., Sui, K., Shen, J., Liu, D., Liu, H., Kong, Y., 2025e. Visual large language models for generalized and specialized applications. arXiv preprint arXiv:2501.02765 . %Type = Article

[59] Liu, J., Chen, H., An, P., Liu, Z., Zhang, R., Gu, C., Li, X., Guo, Z., Chen, S., Liu, M., et al., 2025a. Hybridvla: Collaborative diffusion and autoregression in a unified vision-language-action model. arXiv preprint arXiv:2503.10631 . %Type = Article

[60] Xiong, J., Liu, G., Huang, L., Wu, C., Wu, T., Mu, Y., Yao, Y., Shen, H., Wan, Z., Huang, J., et al., 2024. Autoregressive models in vision: A survey. arXiv preprint arXiv:2411.05902 . %Type = Inproceedings

[61] Lu, J., Clark, C., Lee, S., Zhang, Z., Khosla, S., Marten, R., Hoiem, D., Kembhavi, A., 2024. Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26439–26455. %Type = Article

[62] Tian, K., Jiang, Y., Yuan, Z., Peng, B., Wang, L., 2024. Visual autoregressive modeling: Scalable image generation via next-scale prediction. Advances in neural information processing systems 37, 84839–84865. %Type = Article

[63] Xu, S., Wang, Y., Xia, C., Zhu, D., Huang, T., Xu, C., 2025b. Vla-cache: Towards efficient vision-language-action model via adaptive token caching in robotic manipulation. arXiv preprint arXiv:2502.02175 . %Type = Article

[64] Bordes, F., Pang, R.Y., Ajay, A., Li, A.C., Bardes, A., Petryk, S., Mañas, O., Lin, Z., Mahmoud, A., Jayaraman, B., et al., 2024. An introduction to vision-language modeling. arXiv preprint arXiv:2405.17247 . %Type = Article

[65] Wen, J., Zhu, Y., Li, J., Zhu, M., Tang, Z., Wu, K., Xu, Z., Liu, N., Cheng, R., Shen, C., et al., 2025b. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation. IEEE Robotics and Automation Letters . %Type = Inproceedings

[66] Li, D., Jin, Y., Sun, Y., Yu, H., Shi, J., Hao, X., Hao, P., Liu, H., Sun, F., Zhang, J., et al., 2024a. What foundation models can bring for robot learning in manipulation: A survey. arXiv preprint arXiv:2404.18201 . %Type = Article

[67] Sun, J., Mao, P., Kong, L., Wang, J., 2025b. A review of embodied grasping. Sensors (Basel, Switzerland) 25, 852. %Type = Inproceedings

[68] Imran, A., Gopalakrishnan, K., 2025. Foundation models in robotics, in: AI for Robotics: Toward Embodied and General Intelligence in the Physical World. Springer, pp. 139–210. %Type = Article

[69] Kim, M.J., Finn, C., Liang, P., 2025. Fine-tuning vision-language-action models: Optimizing speed and success. arXiv preprint arXiv:2502.19645 . %Type = Article

[70] Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al., 2024. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 . %Type = Article

[71] Gu, Z., Li, J., Shen, W., Yu, W., Xie, Z., McCrory, S., Cheng, X., Shamsah, A., Griffin, R., Liu, C.K., et al., 2025. Humanoid locomotion and manipulation: Current progress and challenges in control, planning, and learning. arXiv preprint arXiv:2501.02116 . %Type = Article

[72] Firoozi, R., Tucker, J., Tian, S., Majumdar, A., Sun, J., Liu, W., Zhu, Y., Song, S., Kapoor, A., Hausman, K., et al., 2023. Foundation models in robotics: Applications, challenges, and the future. The International Journal of Robotics Research , 02783649241281508. %Type = Article

[73] Song, W., Chen, J., Ding, P., Zhao, H., Zhao, W., Zhong, Z., Ge, Z., Ma, J., Li, H., 2025b. Accelerating vision-language-action model integrated with action chunking via parallel decoding. arXiv preprint arXiv:2503.02310 . %Type = Article

[74] Bartoccioni, F., Ramzi, E., Besnier, V., Venkataramanan, S., Vu, T.H., Xu, Y., Chambon, L., Gidaris, S., Odabas, S., Hurych, D., et al., 2025. Vavim and vavam: Autonomous driving through video generative modeling. arXiv preprint arXiv:2502.15672 . %Type = Inproceedings

[75] Huang, W., Gu, Q., Ye, N., 2025b. Decision spikeformer: Spike-driven transformer for decision making. arXiv preprint arXiv:2504.03800 . %Type = Article

[76] Zhao, H., Song, W., Wang, D., Tong, X., Ding, P., Cheng, X., Ge, Z., 2025a. More: Unlocking scalability in reinforcement learning for quadruped vision-language-action models. arXiv preprint arXiv:2503.08007 . %Type = Article

[77] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Wahid, A., Tompson, J., Vuong, Q., Yu, T., Huang, W., et al., 2023. Palm-e: An embodied multimodal language model. Openreview . %Type = Article

[78] Ding, J., Zhang, Y., Shang, Y., Zhang, Y., Zong, Z., Feng, J., Yuan, Y., Su, H., Li, N., Sukiennik, N., et al., 2024a. Understanding world or predicting future? a comprehensive survey of world models. arXiv preprint arXiv:2411.14499 . %Type = Article

[79] Zhang, K., Yun, P., Cen, J., Cai, J., Zhu, D., Yuan, H., Zhao, C., Feng, T., Wang, M.Y., Chen, Q., et al., 2025e. Generative artificial intelligence in robotic manipulation: A survey. arXiv preprint arXiv:2503.03464 . %Type = Article

[80] Ghosh, A., Acharya, A., Saha, S., Jain, V., Chadha, A., 2024. Exploring the frontier of vision-language models: A survey of current methodologies and future directions. arXiv preprint arXiv:2404.07214 . %Type = Article

[81] Huang, H., Liu, F., Fu, L., Wu, T., Mukadam, M., Malik, J., Goldberg, K., Abbeel, P., 2024. Early fusion helps vision language action models generalize better, in: 1st Workshop on X-Embodiment Robot Learning. %Type = Article

[82] Lu, H., Li, H., Shahani, P.S., Herbers, S., Scheutz, M., 2025. Probing a vision-language-action model for symbolic states and integration into a cognitive architecture. arXiv preprint arXiv:2502.04558 . %Type = Inproceedings

[83] Zhang, H., Ding, P., Lyu, S., Peng, Y., Wang, D., 2025b. Gevrm: Goal-expressive video generation model for robust visual manipulation. arXiv preprint arXiv:2502.09268 . %Type = Article

[84] Lyu, J., Li, Z., Shi, X., Xu, C., Wang, Y., Wang, H., 2025. Dywa: Dynamics-adaptive world action model for generalizable non-prehensile manipulation. arXiv preprint arXiv:2503.16806 . %Type = Article

[85] Xu, J., Sun, Q., Han, Q.L., Tang, Y., 2025a. When embodied ai meets industry 5.0: human-centered smart manufacturing. IEEE/CAA Journal of Automatica Sinica 12, 485–501. %Type = Article

[86] Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., Levine, S., 2025. Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747 . %Type = Article

[87] Chen, X., Xu, W., Kan, S., Zhang, L., Jin, Y., Cen, Y., Li, Y., 2025d. Vision-semantics-label: A new two-step paradigm for action recognition with large language model. IEEE Transactions on Circuits and Systems for Video Technology . %Type = Article

[88] Agarwal, L., Verma, B., 2024. From methods to datasets: A survey on image-caption generators. Multimedia Tools and Applications 83, 28077–28123. %Type = Article

[89] Sameni, S., Kafle, K., Tan, H., Jenni, S., 2024. Building vision-language models on solid foundations with masked distillation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14216–14226. %Type = Inproceedings

[90] Yang, Y., Huang, W., Wei, Y., Peng, H., Jiang, X., Jiang, H., Wei, F., Wang, Y., Hu, H., Qiu, L., et al., 2023a. Attentive mask clip, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2771–2781. %Type = Inproceedings

[91] Chen, H., Liu, B., Wang, S., Wang, X., Han, W., Zhu, Y., Wang, X., Bi, Y., 2025b. Language modulates vision: Evidence from neural networks and human brain-lesion models. arXiv preprint arXiv:2501.13628 . %Type = Article

[92] Bolya, D., Huang, P.Y., Sun, P., Cho, J.H., Madotto, A., Wei, C., Ma, T., Zhi, J., Rajasegaran, J., Rasheed, H., et al., 2025. Perception encoder: The best visual embeddings are not at the output of the network. arXiv preprint arXiv:2504.13181 . %Type = Article

[93] Dang, R., Yuan, Y., Zhang, W., Xin, Y., Zhang, B., Li, L., Wang, L., Zeng, Q., Li, X., Bing, L., 2025. Ecbench: Can multi-modal foundation models understand the egocentric world? a holistic embodied cognition benchmark. arXiv preprint arXiv:2501.05031 . %Type = Article

[94] Wang, J., Guo, D., Liu, H., 2025a. Where to learn: Embodied perception learning planned by vision-language models. IEEE Transactions on Cognitive and Developmental Systems . %Type = Article

[95] Dasari, S., Ebert, F., Tian, S., Nair, S., Bucher, B., Schmeckpeper, K., Singh, S., Levine, S., Finn, C., 2019. Robonet: Large-scale multi-robot learning. arXiv preprint arXiv:1910.11215 . %Type = Misc

[96] Ebert, F., Yang, Y., Schmeckpeper, K., Bucher, B., Georgakis, G., Daniilidis, K., Finn, C., Levine, S., 2021. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. arXiv preprint arXiv:2109.13396 . %Type = Misc

[97] Vuong, Q., Levine, S., Walke, H.R., Pertsch, K., Singh, A., Doshi, R., Xu, C., Luo, J., Tan, L., Shah, D., et al., 2023. Open x-embodiment: Robotic learning datasets and rt-x models, in: Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023. %Type = Article

[98] Moroncelli, A., Soni, V., Shahid, A.A., Maccarini, M., Forgione, M., Piga, D., Spahiu, B., Roveda, L., 2024. Integrating reinforcement learning with foundation models for autonomous robotics: Methods and perspectives. arXiv preprint arXiv:2410.16411 . %Type = Article

[99] Karamcheti, S., Zhai, A.J., Losey, D.P., Sadigh, D., 2021. Learning visually guided latent actions for assistive teleoperation, in: Learning for dynamics and control, PMLR. pp. 1230–1241. %Type = Book

[100] Belkhale, S., Ding, T., Xiao, T., Sermanet, P., Vuong, Q., Tompson, J., Chebotar, Y., Dwibedi, D., Sadigh, D., 2024. Rt-h: Action hierarchies using language. arXiv preprint arXiv:2403.01823 . %Type = Article

[101] Foster, D.J., Block, A., Misra, D., 2024. Is behavior cloning all you need? understanding horizon in imitation learning. arXiv preprint arXiv:2407.15007 . %Type = Article

[102] Guo, Y., Zhang, J., Chen, X., Ji, X., Wang, Y.J., Hu, Y., Chen, J., 2025. Improving vision-language-action model with online reinforcement learning. arXiv preprint arXiv:2501.16664 . %Type = Article

[103] Zhou, Z., Zhu, Y., Zhu, M., Wen, J., Liu, N., Xu, Z., Meng, W., Cheng, R., Peng, Y., Shen, C., et al., 2025b. Chatvla: Unified multimodal understanding and robot control with vision-language-action model. arXiv preprint arXiv:2502.14420 . %Type = Article

[104] Zhen, H., Qiu, X., Chen, P., Yang, J., Yan, X., Du, Y., Hong, Y., Gan, C., 2024. 3d-vla: A 3d vision-language-action generative world model. arXiv preprint arXiv:2403.09631 . %Type = Article

[105] Li, C., Wen, J., Peng, Y., Peng, Y., Feng, F., Zhu, Y., 2025a. Pointvla: Injecting the 3d world into vision-language-action models. arXiv preprint arXiv:2503.07511 . %Type = Article

[106] Wang, S., 2025. Roboflamingo-plus: Fusion of depth and rgb perception with vision-language models for enhanced robotic manipulation. arXiv preprint arXiv:2503.19510 . %Type = Article

[107] Fan, L., Chen, K., Xu, Z., Yuan, M., Huang, P., Huang, W., 2024. Language reasoning in vision-language-action model for robotic grasping, in: 2024 China Automation Congress (CAC), IEEE. pp. 6656–6661. %Type = Article

[108] Serpiva, V., Lykov, A., Myshlyaev, A., Khan, M.H., Abdulkarim, A.A., Sautenkov, O., Tsetserukou, D., 2025. Racevla: Vla-based racing drone navigation with human-like behaviour. arXiv preprint arXiv:2503.02572 . %Type = Article

[109] Ray, P.P., 2023. Chatgpt: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems 3, 121–154. %Type = Article

[110] Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al., 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 . %Type = Article

[111] Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al., 2022. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems 35, 23716–23736. %Type = Inproceedings

[112] Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., et al., 2023. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818 . %Type = Article

[113] Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al., 2023. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 . %Type = Article

[114] Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L., 2023. Sigmoid loss for language image pre-training, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 11975–11986. %Type = Article

[115] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al., 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 . %Type = Article

[116] Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al., 2022. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in neural information processing systems 35, 25278–25294. %Type = Article

[117] Yang, Z., Chen, Y., Wang, J., Manivasagam, S., Ma, W.C., Yang, A.J., Urtasun, R., 2023b. Unisim: A neural closed-loop sensor simulator, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1389–1399. %Type = Article

[118] Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al., 2022. Lora: Low-rank adaptation of large language models. ICLR 1, 3. %Type = Article

[119] Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al., 2024. Pi-0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 . %Type = Article

[120] Zhang, K., Yin, Z.H., Ye, W., Gao, Y., 2024c. Learning manipulation skills through robot chain-of-thought with sparse failure guidance. arXiv preprint arXiv:2405.13573 . %Type = Article

[121] Zawalski, M., Chen, W., Pertsch, K., Mees, O., Finn, C., Levine, S., 2024. Robotic control via embodied chain-of-thought reasoning. arXiv preprint arXiv:2407.08693 . %Type = Inproceedings

[122] Duan, J., Pumacay, W., Kumar, N., Wang, Y.R., Tian, S., Yuan, W., Krishna, R., Fox, D., Mandlekar, A., Guo, Y., 2024. Aha: A vision-language-model for detecting and reasoning over failures in robotic manipulation. arXiv preprint arXiv:2410.00371 . %Type = Article

[123] Wang, Z., Zhou, Z., Song, J., Huang, Y., Shu, Z., Ma, L., 2024d. Towards testing and evaluating vision-language-action models for robotic manipulation: An empirical study. arXiv preprint arXiv:2409.12894 . %Type = Article

[124] Cheng, A.C., Ji, Y., Yang, Z., Gongye, Z., Zou, X., Kautz, J., Bıyık, E., Yin, H., Liu, S., Wang, X., 2024a. Navila: Legged robot vision-language-action model for navigation. arXiv preprint arXiv:2412.04453 . %Type = Article

[125] Fu, H., Zhang, D., Zhao, Z., Cui, J., Liang, D., Zhang, C., Zhang, D., Xie, H., Wang, B., Bai, X., 2025. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation. arXiv preprint arXiv:2503.19755 . %Type = Article

[126] Zhong, Y., Huang, X., Li, R., Zhang, C., Liang, Y., Yang, Y., Chen, Y., 2025. Dexgraspvla: A vision-language-action framework towards general dexterous grasping. arXiv preprint arXiv:2502.20900 . %Type = Article

[127] Team, G.R., Abeyruwan, S., Ainslie, J., Alayrac, J.B., Arenas, M.G., Armstrong, T., Balakrishna, A., Baruch, R., Bauza, M., Blokzijl, M., et al., 2025. Gemini robotics: Bringing ai into the physical world. arXiv preprint arXiv:2503.20020 . %Type = Article

[128] Chen, P., Bu, P., Wang, Y., Wang, X., Wang, Z., Guo, J., Zhao, Y., Zhu, Q., Song, J., Yang, S., et al., 2025c. Combatvla: An efficient vision-language-action model for combat tasks in 3d action role-playing games. arXiv preprint arXiv:2503.09527 . %Type = Article

[129] Zhu, M., Zhu, Y., Li, J., Zhou, Z., Wen, J., Liu, X., Shen, C., Peng, Y., Feng, F., 2025. Objectvla: End-to-end open-world object manipulation without demonstration. arXiv preprint arXiv:2502.19250 . %Type = Inproceedings

[130] Zheng, J., Li, J., Liu, D., Zheng, Y., Wang, Z., Ou, Z., Liu, Y., Liu, J., Zhang, Y.Q., Zhan, X., 2025. Universal actions for enhanced embodied foundation models. arXiv preprint arXiv:2501.10105 . %Type = Article

[131] Yang, R., Chen, G., Wen, C., Gao, Y., 2025. Fp3: A 3d foundation policy for robotic manipulation. arXiv preprint arXiv:2503.08950 . %Type = Inproceedings

[132] Gu, X., Wen, C., Ye, W., Song, J., Gao, Y., 2023. Seer: Language instructed video prediction with latent diffusion models. arXiv preprint arXiv:2303.14897 . %Type = Article

[133] Li, S., Wang, J., Dai, R., Ma, W., Ng, W.Y., Hu, Y., Li, Z., 2024e. Robonurse-vla: Robotic scrub nurse system based on vision-language-action model. arXiv preprint arXiv:2409.19590 . %Type = Article

[134] Chiang, H.T.L., Xu, Z., Fu, Z., Jacob, M.G., Zhang, T., Lee, T.W.E., Yu, W., Schenck, C., Rendleman, D., Shah, D., et al., 2024. Mobility vla: Multimodal instruction navigation with long-context vlms and topological graphs. arXiv preprint arXiv:2407.07775 . %Type = Article

[135] Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H., Zhu, J., 2024b. Rdt-1b: a diffusion foundation model for bimanual manipulation. arXiv preprint arXiv:2410.07864 . %Type = Article

[136] Sautenkov, O., Yaqoot, Y., Lykov, A., Mustafa, M.A., Tadevosyan, G., Akhmetkazy, A., Cabrera, M.A., Martynov, M., Karaf, S., Tsetserukou, D., 2025. Uav-vla: Vision-language-action system for large scale aerial mission generation. arXiv preprint arXiv:2501.05014 . %Type = Article

[137] Hung, C.Y., Sun, Q., Hong, P., Zadeh, A., Li, C., Tan, U., Majumder, N., Poria, S., et al., 2025. Nora: A small open-sourced generalist vision language action model for embodied tasks. arXiv preprint arXiv:2504.19854 . %Type = Inproceedings

[138] Zhang, R., Dong, M., Zhang, Y., Heng, L., Chi, X., Dai, G., Du, L., Wang, D., Du, Y., Zhang, S., 2025f. Mole-vla: Dynamic layer-skipping vision language action model via mixture-of-layers for efficient robot manipulation. arXiv preprint arXiv:2503.20384 . %Type = Article

[139] Li, M., Wang, Z., He, K., Ma, X., Liang, Y., 2025b. Jarvis-vla: Post-training large-scale vision language models to play visual games with keyboards and mouse. arXiv preprint arXiv:2503.16365 . %Type = Article

[140] Zhang, J., Guo, Y., Hu, Y., Chen, X., Zhu, X., Chen, J., 2025d. Up-vla: A unified understanding and prediction model for embodied agent. arXiv preprint arXiv:2501.18867 . %Type = Article

[141] Khan, M.H., Asfaw, S., Iarchuk, D., Cabrera, M.A., Moreno, L., Tokmurziyev, I., Tsetserukou, D., 2025. Shake-vla: Vision-language-action model-based system for bimanual robotic manipulations and liquid mixing. arXiv preprint arXiv:2501.06919 . %Type = Article

[142] Wen, J., Zhu, Y., Li, J., Tang, Z., Shen, C., Feng, F., 2025a. Dexvla: Vision-language model with plug-in diffusion expert for general robot control. arXiv preprint arXiv:2502.05855 . %Type = Article

[143] Wen, J., Zhu, M., Zhu, Y., Tang, Z., Li, J., Zhou, Z., Li, C., Liu, X., Peng, Y., Shen, C., et al., 2024. Diffusion-vla: Scaling robot foundation models via unified diffusion and autoregression. arXiv preprint arXiv:2412.03293 . %Type = Article

[144] Deng, S., Yan, M., Wei, S., Ma, H., Yang, Y., Chen, J., Zhang, Z., Yang, T., Zhang, X., Cui, H., Zhang, Z., Wang, H., 2025. Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data. Available: https://arxiv.org/abs/2505.03233, arXiv:2505.03233. %Type = Article

[145] Fan, C., Jia, X., Sun, Y., Wang, Y., Wei, J., Gong, Z., Zhao, X., Tomizuka, M., Yang, X., Yan, J., Ding, M., 2025. Interleave-vla: Enhancing robot manipulation with interleaved image-text instructions. Available: https://arxiv.org/abs/2505.02152, arXiv:2505.02152. %Type = Inproceedings

[146] Gbagbe, K.F., Cabrera, M.A., Alabbas, A., Alyunes, O., Lykov, A., Tsetserukou, D., 2024. Bi-vla: Vision-language-action model-based system for bimanual robotic dexterous manipulations, in: 2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC), IEEE. pp. 2864–2869. %Type = Inproceedings

[147] Huang, H., Liu, F., Fu, L., Wu, T., Mukadam, M., Malik, J., Goldberg, K., Abbeel, P., 2025a. Otter: A vision-language-action model with text-aware visual feature extraction. arXiv preprint arXiv:2503.03734 . %Type = Article

[148] Roychoudhury, A., Khorshidi, S., Agrawal, S., Bennewitz, M., 2023. Perception for humanoid robots. Current Robotics Reports 4, 127–140. %Type = Article

[149] Cao, L., 2024. Ai robots and humanoid ai: Review, perspectives and directions. arXiv preprint arXiv:2405.15775 . %Type = Article

[150] Lu, Y., Liao, Z., 2023. Towards happy housework: Scenario-based experience design for a household cleaning robotic system. EAI Endorsed Transactions on Scalable Information Systems 10. %Type = Article

[151] Zhu, D.H., Chang, Y.P., 2020. Robot with humanoid hands cooks food better? effect of robotic chef anthropomorphism on food quality prediction. International Journal of Contemporary Hospitality Management 32, 1367–1383. %Type = Article

[152] Asuzu, K., Singh, H., Idrissi, M., 2025. Human–robot interaction through joint robot planning with large language models. Intelligent Service Robotics , 1–17. %Type = Article

[153] Chen, H., Li, S., Fan, J., Duan, A., Yang, C., Navarro-Alarcon, D., Zheng, P., 2025a. Human-in-the-loop robot learning for smart manufacturing: A human-centric perspective. IEEE Transactions on Automation Science and Engineering . %Type = Article

[154] Assres, G., Bhandari, G., Shalaginov, A., Gronli, T.M., Ghinea, G., 2025. State-of-the-art and challenges of engineering ml-enabled software systems in the deep learning era. ACM Computing Surveys . %Type = Article

[155] Asif, S., Bueno, M., Ferreira, P., Anandan, P., Zhang, Z., Yao, Y., Ragunathan, G., Tinkler, L., Sotoodeh-Bahraini, M., Lohse, N., et al., 2025. Rapid and automated configuration of robot manufacturing cells. Robotics and Computer-Integrated Manufacturing 92, 102862. %Type = Article

[156] Rodriguez-Guerra, D., Sorrosal, G., Cabanes, I., Calleja, C., 2021. Human-robot interaction review: Challenges and solutions for modern industrial environments. Ieee Access 9, 108557–108578. %Type = Article

[157] Li, Y., Gong, Z., Li, H., Huang, X., Kang, H., Bai, G., Ma, X., 2025d. Robotic visual instruction. arXiv preprint arXiv:2505.00693 . %Type = Article

[158] Gao, J., Belkhale, S., Dasari, S., Balakrishna, A., Shah, D., Sadigh, D., 2025b. A taxonomy for evaluating generalist robot policies. arXiv preprint arXiv:2503.01238 . %Type = Inproceedings

[159] Schmidgall, S., Cho, J., Zakka, C., Hiesinger, W., 2024. Gp-vls: A general-purpose vision language model for surgery. arXiv preprint arXiv:2407.19305 . %Type = Article

[160] Pantalone, D., Faini, G.S., Cialdai, F., Sereni, E., Bacci, S., Bani, D., Bernini, M., Pratesi, C., Stefàno, P., Orzalesi, L., et al., 2021. Robot-assisted surgery in space: pros and cons. a review from the surgeon’s point of view. npj Microgravity 7, 56. %Type = Article

[161] Si, W., Wang, N., Yang, C., 2021. A review on manipulation skill acquisition through teleoperation-based learning from demonstration. Cognitive Computation and Systems 3, 1–16. %Type = Article

[162] Verbaan, L., 2024. Perception and control with large language models in robotic manipulation. TU Delft Library . %Type = Inproceedings

[163] Ding, D., Yao, T., Luo, R., Sun, X., 2025a. Visual question answering in robotic surgery: A comprehensive review. IEEE Access . %Type = Article

[164] Wang, H., Xing, Z., Wu, W., Yang, Y., Tang, Q., Zhang, M., Xu, Y., Zhu, L., 2024b. Non-invasive to invasive: Enhancing ffa synthesis from cfp with a benchmark dataset and a novel network, in: Proceedings of the 1st International Workshop on Multimedia Computing for Health and Medicine, pp. 7–15. %Type = Article

[165] Li, J., Skinner, G., Yang, G., Quaranto, B.R., Schwaitzberg, S.D., Kim, P.C., Xiong, J., 2024b. Llava-surg: towards multimodal surgical assistant via structured surgical video learning. arXiv preprint arXiv:2408.07981 . %Type = Inproceedings

[166] Liu, Y., Cao, X., Chen, T., Jiang, Y., You, J., Wu, M., Wang, X., Feng, M., Jin, Y., Chen, J., 2025c. A survey of embodied ai in healthcare: Techniques, applications, and opportunities. arXiv preprint arXiv:2501.07468 . %Type = Article

[167] Wang, Y., Wu, S., Zhang, Y., Yan, S., Liu, Z., Luo, J., Fei, H., 2025b. Multimodal chain-of-thought reasoning: A comprehensive survey. arXiv preprint arXiv:2503.12605 . %Type = Article

[168] Trivedi, C., Bhattacharya, P., Prasad, V.K., Patel, V., Singh, A., Tanwar, S., Sharma, R., Aluvala, S., Pau, G., Sharma, G., 2024. Explainable ai for industry 5.0: vision, architecture, and potential directions. IEEE Open Journal of Industry Applications . %Type = Article

[169] Liu, Y., Cao, X., Chen, T., Jiang, Y., You, J., Wu, M., Wang, X., Feng, M., Jin, Y., Chen, J., 2025b. From screens to scenes: A survey of embodied ai in healthcare. arXiv preprint arXiv:2501.07468 . %Type = Article

[170] Zhang, H., Zantout, N., Kachana, P., Wu, Z., Zhang, J., Wang, W., 2024a. Vla-3d: A dataset for 3d semantic scene understanding and navigation. arXiv preprint arXiv:2411.03540 . %Type = Article

[171] Ayaz, M., Khan, M., Saqib, M., Khelifi, A., Sajjad, M., Elsaddik, A., 2024. Medvlm: Medical vision-language model for consumer devices. IEEE Consumer Electronics Magazine . %Type = Article

[172] Wang, G., Bai, L., Nah, W.J., Wang, J., Zhang, Z., Chen, Z., Wu, J., Islam, M., Liu, H., Ren, H., 2024a. Surgical-lvlm: Learning to adapt large vision-language model for grounded visual question answering in robotic surgery. arXiv preprint arXiv:2405.10948 . %Type = Inproceedings

[173] Dong, H., Liu, M., Zhou, K., Chatzi, E., Kannala, J., Stachniss, C., Fink, O., 2025. Advances in multimodal adaptation and generalization: From traditional approaches to foundation models. arXiv preprint arXiv:2501.18592 . %Type = Inproceedings

[174] Hu, Y., Tang, J., Gong, X., Zhou, Z., Zhang, S., Elvitigala, D.S., Mueller, F., Hu, W., Quigley, A.J., 2025. Vision-based multimodal interfaces: A survey and taxonomy for enhanced context-aware system design. arXiv preprint arXiv:2501.13443 . %Type = Inproceedings

[175] Xu, D., Chen, Y., Wang, J., Huang, Y., Wang, H., Jin, Z., Wang, H., Yue, W., He, J., Li, H., et al., 2024a. Mlevlm: Improve multi-level progressive capabilities based on multimodal large language model for medical visual question answering, in: Findings of the Association for Computational Linguistics ACL 2024, pp. 4977–4997. %Type = Article

[176] Gao, B., Liu, Y., Li, Y., Li, H., Li, M., He, W., 2025a. A vision-language model for predicting potential distribution land of soybean double cropping. Frontiers in Environmental Science 12, 1515752. %Type = Article

[177] Tian, H., Wang, T., Liu, Y., Qiao, X., Li, Y., 2020. Computer vision technology in agricultural automation—a review. Information processing in agriculture 7, 1–19. %Type = Article

[178] Jha, K., Doshi, A., Patel, P., Shah, M., 2019. A comprehensive review on automation in agriculture using artificial intelligence. Artificial Intelligence in Agriculture 2, 1–12. %Type = Article

[179] Park, S.M., Kim, Y.G., 2023. Visual language integration: A survey and open challenges. Computer Science Review 48, 100548. %Type = Inproceedings

[180] Guruprasad, P., Sikka, H., Song, J., Wang, Y., Liang, P.P., 2024. Benchmarking vision, language, & action models on robotic learning tasks. arXiv preprint arXiv:2411.05821 . %Type = Article

[181] Wu, W., Feng, X., Gao, Z., Kan, Y., 2024b. Smart: scalable multi-agent real-time motion generation via next-token prediction. Advances in Neural Information Processing Systems 37, 114048–114071. %Type = Article

[182] Bathula, N.V., Paleti, I., Pagidi, S., Akkumahanthi, S.S., Guduru, N.T., 2024. Policy learning-based image captioning with vision transformer, in: 2024 IEEE International Students' Conference on Electrical, Electronics and Computer Science (SCEECS), IEEE. pp. 1–6. %Type = Article

[183] Haldar, S., Peng, Z., Pinto, L., 2024. Baku: An efficient transformer for multi-task policy learning. arXiv preprint arXiv:2406.07539 . %Type = Article

[184] Chen, H., Hou, L., Wu, S., Zhang, G., Zou, Y., Moon, S., Bhuiyan, M., 2024b. Augmented reality, deep learning and vision-language query system for construction worker safety. Automation in Construction 157, 105158. %Type = Article

[185] Ikeda, B., Gramopadhye, M., Nekervis, L., Szafir, D., 2025. Marcer: Multimodal augmented reality for composing and executing robot tasks, in: 2025 20th ACM/IEEE International Conference on Human-Robot Interaction (HRI), IEEE. pp. 529–539. %Type = Incollection

[186] Xue, H., Ren, J., Chen, W., Zhang, G., Fang, Y., Gu, G., Xu, H., Lu, C., 2025. Reactive diffusion policy: Slow-fast visual-tactile policy learning for contact-rich manipulation. arXiv preprint arXiv:2503.02881 . %Type = Article

[187] Chatzopoulos, D., Bermejo, C., Huang, Z., Hui, P., 2017. Mobile augmented reality survey: From where we are to where we go. Ieee Access 5, 6917–6950. %Type = Inproceedings

[188] Singh, S., Singh, J., Shah, B., Sehra, S.S., Ali, F., 2022. Augmented reality and gps-based resource efficient navigation system for outdoor environments: Integrating device camera, sensors, and storage. Sustainability 14, 12720. %Type = Article

[189] Pang, J., Zheng, P., Fan, J., Liu, T., 2025. Towards cognition-augmented human-centric assembly: A visual computation perspective. Robotics and Computer-Integrated Manufacturing 91, 102852. %Type = Article

[190] Chen, Y., Tian, S., Liu, S., Zhou, Y., Li, H., Zhao, D., 2025e. Conrft: A reinforced fine-tuning method for vla models via consistency policy. arXiv preprint arXiv:2502.05450 . %Type = Article

[191] Shi, L.X., Ichter, B., Equi, M., Ke, L., Pertsch, K., Vuong, Q., Tanner, J., Walling, A., Wang, H., Fusai, N., et al., 2025. Hi robot: Open-ended instruction following with hierarchical vision-language-action models. arXiv preprint arXiv:2502.19417 . %Type = Article

[192] Hao, P., Zhang, C., Li, D., Cao, X., Hao, X., Cui, S., Wang, S., 2025. Tla: Tactile-language-action model for contact-rich manipulation. arXiv preprint arXiv:2503.08548 . %Type = Phdthesis

[193] Geens, R., 2024. Bringing generative ai to edge devices through interoperable compute cores, in: Flanders AI Research Day, Location: Ghent. %Type = Article

[194] Rodriguez-Juan, J., Ortiz-Perez, D., Garcia-Rodriguez, J., Tomás, D., Nalepa, G.J., 2025. Integrating advanced vision-language models for context recognition in risks assessment. Neurocomputing 618, 129131. %Type = Article

[195] Wang, T., Han, C., Liang, J.C., Yang, W., Liu, D., Zhang, L.X., Wang, Q., Luo, J., Tang, R., 2024c. Exploring the adversarial vulnerabilities of vision-language-action models in robotics. arXiv preprint arXiv:2411.13587 . %Type = Article

[196] Jiang, Y., Zhang, R., Wong, J., Wang, C., Ze, Y., Yin, H., Gokmen, C., Song, S., Wu, J., Fei-Fei, L., 2025. Behavior robot suite: Streamlining real-world whole-body manipulation for everyday household activities. arXiv preprint arXiv:2503.05652 . %Type = Article

[197] Sahili, Z.A., Patras, I., Purver, M., 2025. Scaling for fairness? analyzing model size, data composition, and multilinguality in vision-language bias. arXiv preprint arXiv:2501.13223 . %Type = Inproceedings

[198] Jiang, J., Xiao, W., Lin, Z., Zhang, H., Ren, T., Gao, Y., Lin, Z., Cai, Z., Yang, L., Liu, Z., 2024. Solami: Social vision-language-action modeling for immersive interaction with 3d autonomous characters. arXiv preprint arXiv:2412.00174 . %Type = Article

[199] Ye, S., Jang, J., Jeon, B., Joo, S., Yang, J., Peng, B., Mandlekar, A., Tan, R., Chao, Y.W., Lin, B.Y., et al., 2024. Latent action pretraining from videos. arXiv preprint arXiv:2410.11758 . %Type = Article

[200] Liu, Z., Liang, H., Huang, X., Xiong, W., Yu, Q., Sun, L., Chen, C., He, C., Cui, B., Zhang, W., 2024c. Synthvlm: High-efficiency and high-quality synthetic data for vision language models. arXiv preprint arXiv:2407.20756 . %Type = Article

[201] Sun, H., Wang, H., Ma, C., Zhang, S., Ye, J., Chen, X., Lan, X., 2025a. Prism: Projection-based reward integration for scene-aware real-to-sim-to-real transfer with few demonstrations. arXiv preprint arXiv:2504.20520 . %Type = Article

[202] Li, Y., Deng, Y., Zhang, J., Jang, J., Memmel, M., Yu, R., Garrett, C.R., Ramos, F., Fox, D., Li, A., et al., 2025c. Hamster: Hierarchical action models for open-world robot manipulation. arXiv preprint arXiv:2502.05485 . %Type = Article

[203] Anderson, P., Shrivastava, A., Truong, J., Majumdar, A., Parikh, D., Batra, D., Lee, S., 2021. Sim-to-real transfer for vision-and-language navigation, in: Conference on Robot Learning, PMLR. pp. 671–681. %Type = Inproceedings

[204] Fang, Y., Yang, Y., Zhu, X., Zheng, K., Bertasius, G., Szafir, D., Ding, M., 2025. Rebot: Scaling robot learning with real-to-sim-to-real robotic video synthesis. arXiv preprint arXiv:2503.14526 . %Type = Article

[205] Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al., 2025. pi0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 . %Type = Article

[206] Jones, J., Mees, O., Sferrazza, C., Stachowicz, K., Abbeel, P., Levine, S., 2025. Beyond sight: Finetuning generalist robot policies with heterogeneous sensors via language grounding. arXiv preprint arXiv:2501.04693 . %Type = Inproceedings

[207] Hong, Y., 2025. Building 3D Foundation Models for the Embodied Minds. Ph.D. thesis. University of California, Los Angeles. %Type = Article

[208] Noorani, E., Serlin, Z., Price, B., Velasquez, A., 2025. From abstraction to reality: Darpa's vision for robust sim-to-real autonomy. arXiv preprint arXiv:2503.11007 . %Type = Article

[209] Zhang, H., Yu, H., Zhao, L., Choi, A., Bai, Q., Yang, B., Xu, W., 2025c. Slim: Sim-to-real legged instructive manipulation via long-horizon visuomotor learning. arXiv preprint arXiv:2501.09905 . %Type = Article

[210] Samson, M., Muraccioli, B., Kanehiro, F., 2025. Scalable, training-free visual language robotics: a modular multi-model framework for consumer-grade gpus, in: 2025 IEEE/SICE International Symposium on System Integration (SII), IEEE. pp. 193–198. %Type = Article

[211] Song, M., Deng, X., Zhou, Z., Wei, J., Guan, W., Nie, L., 2025a. A survey on diffusion policy for robotic manipulation: Taxonomy, analysis, and future directions. Authorea Preprints . %Type = Article

[212] Polubarov, A., Lyubaykin, N., Derevyagin, A., Zisman, I., Tarasov, D., Nikulin, A., Kurenkov, V., 2025. Vintix: Action model via in-context reinforcement learning. arXiv preprint arXiv:2501.19400 . %Type = Article

[213] Sharshar, A., Khan, L.U., Ullah, W., Guizani, M., 2025. Vision-language models for edge networks: A comprehensive survey. arXiv preprint arXiv:2502.07855 . %Type = Article

[214] Mumuni, A., Mumuni, F., 2025. Large language models for artificial general intelligence (agi): A survey of foundational principles and approaches. arXiv preprint arXiv:2501.03151 . %Type = Article

[215] Raza, S., Qureshi, R., Zahid, A., Fioresi, J., Sadak, F., Saeed, M., Sapkota, R., Jain, A., Zafar, A., Hassan, M.U., et al., 2025. Who is responsible? the data, models, users or regulations? responsible generative ai for a sustainable future. arXiv preprint arXiv:2502.08650 . %Type = Article

[216] Plaat, A., van Duijn, M., van Stein, N., Preuss, M., van der Putten, P., Batenburg, K.J., 2025. Agentic large language models, a survey. arXiv preprint arXiv:2503.23037 . %Type = Article

[217] Nie, Y., Li, L., Gan, Z., Wang, S., Zhu, C., Zeng, M., Liu, Z., Bansal, M., Wang, L., 2021. Mlp architectures for vision-and-language modeling: An empirical study. arXiv preprint arXiv:2112.04453 . %Type = Article

[218] Xiang, T.Y., Jin, A.Q., Zhou, X.H., Gui, M.J., Xie, X.L., Liu, S.Q., Wang, S.Y., Duang, S.B., Wang, S.C., Lei, Z., et al., 2025. Vla model-expert collaboration for bi-directional manipulation learning. arXiv preprint arXiv:2503.04163 . %Type = Article

[219] Cheng, H., Xiao, E., Yu, C., Yao, Z., Cao, J., Zhang, Q., Wang, J., Sun, M., Xu, K., Gu, J., et al., 2024b. Manipulation facing threats: Evaluating physical vulnerabilities in end-to-end vision language action models. arXiv preprint arXiv:2409.13174 . %Type = Article

[220] Patel, D., Eghbalzadeh, H., Kamra, N., Iuzzolino, M.L., Jain, U., Desai, R., 2023. Pretrained language models as visual planners for human assistance, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15302–15314. %Type = Article

[221] Szot, A., Mazoure, B., Agrawal, H., Hjelm, R.D., Kira, Z., Toshev, A., 2024. Grounding multimodal large language models in actions. Advances in Neural Information Processing Systems 37, 20198–20224. %Type = Article

[222] Kelly, C., Hu, L., Yang, B., Tian, Y., Yang, D., Yang, C., Huang, Z., Li, Z., Hu, J., Zou, Y., 2024. Visiongpt: Vision-language understanding agent using generalized multimodal framework. arXiv preprint arXiv:2403.09027 . %Type = Article

[223] Torres, N., Ulloa, C., Araya, I., Ayala, M., Jara, S., 2024. A comprehensive analysis of gender, racial, and prompt-induced biases in large language models. International Journal of Data Science and Analytics , 1–38. %Type = Article

[224] Lee, N., Bang, Y., Lovenia, H., Cahyawijaya, S., Dai, W., Fung, P., 2023. Survey of social bias in vision-language models. arXiv preprint arXiv:2309.14381 . %Type = Article