Skip to content
Back to Docs

Core Concepts

What is VLA?

Vision-Language-Action (VLA) models are multimodal foundation models that combine vision, language, and robotic action capabilities. Given a camera image and a text instruction, a VLA directly outputs low-level robot actions.

Architecture

A typical VLA consists of:

  1. Vision Encoder — Processes camera images (e.g., SigLIP, DinoV2, Qwen2ViT)
  2. Language Model — Processes text instructions (e.g., Llama 2, Dream-7B)
  3. Action Head — Outputs robot actions (discrete tokens or continuous flow-matching)

Action Space

Standard action space is 7-DoF end-effector delta:

  • x, y, z — Position delta
  • roll, pitch, yaw — Orientation delta
  • gripper — Gripper state (0=closed, 1=open)

Training Methods

  • LoRA — Low-Rank Adaptation. Fine-tunes a small subset of parameters.
  • QLoRA — Quantized LoRA. Uses 4-bit quantization for lower VRAM.
  • Full — Fine-tunes all parameters. Best quality but highest VRAM.