Core Concepts

Name: vlarobot
Author: vlarobot

What is VLA?

Vision-Language-Action (VLA) models are multimodal foundation models that combine vision, language, and robotic action capabilities. Given a camera image and a text instruction, a VLA directly outputs low-level robot actions.

Architecture

A typical VLA consists of:

Vision Encoder — Processes camera images (e.g., SigLIP, DinoV2, Qwen2ViT)
Language Model — Processes text instructions (e.g., Llama 2, Dream-7B)
Action Head — Outputs robot actions (discrete tokens or continuous flow-matching)

Action Space

Standard action space is 7-DoF end-effector delta:

x, y, z — Position delta
roll, pitch, yaw — Orientation delta
gripper — Gripper state (0=closed, 1=open)

Training Methods

LoRA — Low-Rank Adaptation. Fine-tunes a small subset of parameters.
QLoRA — Quantized LoRA. Uses 4-bit quantization for lower VRAM.
Full — Fine-tunes all parameters. Best quality but highest VRAM.