Core Concepts
What is VLA?
Vision-Language-Action (VLA) models are multimodal foundation models that combine vision, language, and robotic action capabilities. Given a camera image and a text instruction, a VLA directly outputs low-level robot actions.
Architecture
A typical VLA consists of:
- Vision Encoder — Processes camera images (e.g., SigLIP, DinoV2, Qwen2ViT)
- Language Model — Processes text instructions (e.g., Llama 2, Dream-7B)
- Action Head — Outputs robot actions (discrete tokens or continuous flow-matching)
Action Space
Standard action space is 7-DoF end-effector delta:
- x, y, z — Position delta
- roll, pitch, yaw — Orientation delta
- gripper — Gripper state (0=closed, 1=open)
Training Methods
- LoRA — Low-Rank Adaptation. Fine-tunes a small subset of parameters.
- QLoRA — Quantized LoRA. Uses 4-bit quantization for lower VRAM.
- Full — Fine-tunes all parameters. Best quality but highest VRAM.