Skip to content

Video Backbone

The frozen brain. A video foundation model that already understands how the physical world moves — we just give it eyes and connect it to a body.


Supported Models

Six backbones, two families, one interface:

Model Params Hidden Size VRAM (4-bit) Character
Qwen/Qwen2.5-VL-3B-Instruct 3B 2048 ~4 GB Fast prototyping, Jetson-friendly
Qwen/Qwen2.5-VL-7B-Instruct 7B 3584 ~8 GB Production quality
Qwen/Qwen2.5-Omni-3B 3B 2048 ~4 GB Audio-native — hears and speaks
Qwen/Qwen2.5-Omni-7B 7B 3584 ~8 GB Full multimodal — the flagship
nvidia/Cosmos-Reason2-2B 2B 1536 ~3 GB Edge physics reasoning
nvidia/Cosmos-Reason2-8B 8B 4096 ~10 GB Deep physical world modeling

Configuration

from neon.model.video_backbone import BackboneConfig, VideoBackbone

config = BackboneConfig(
    model_id="Qwen/Qwen2.5-VL-3B-Instruct",
    torch_dtype="bfloat16",
    device_map="auto",
    attn_implementation="eager",  # "flash_attention_2" if installed
    fps=4,                         # Video frame rate for temporal input
    max_frames=16,                 # Max frames per video clip
    image_size=224,
    freeze_backbone=True,          # Freeze for action head training
    load_in_4bit=True,             # 4-bit NF4 quantization
    use_double_quant=True,         # Extra savings via double quantization
)

Key Decisions

freeze_backbone = True (default)
Only action heads train. Fast, data-efficient, recommended unless you have >100K episodes and 40+ GB VRAM.
freeze_backbone = False
Backbone also trains via LoRA. More expressive but needs significantly more data and compute.
load_in_4bit = True
Quantizes backbone weights to 4-bit NF4. Reduces VRAM by ~4× with minimal quality loss. Essential for consumer GPUs.
fps = 4
How many frames per second when processing video. Higher = more temporal detail, more compute. 4 FPS is a good default for 50 Hz control.

Loading and Encoding

The backbone loads lazily — no weights downloaded until you need them:

backbone = VideoBackbone(config)
# Nothing downloaded yet — instant creation

backbone.load()
# Now the model is in GPU memory

# Encode a single image + instruction
result = backbone.encode(
    images=[pil_image],
    text="Pick up the red cup",
)
# result["hidden_states"]: (1, seq_len, hidden_size) — full token features
# result["pooled"]:        (1, hidden_size) — mean-pooled representation

Video Input — Temporal Reasoning

For tasks that require understanding motion, pass multiple frames:

result = backbone.encode(
    video_frames=[frame_t3, frame_t2, frame_t1, frame_t0],
    text="Catch the moving ball",
)
# The backbone processes all frames with temporal attention
# Motion, velocity, trajectory — all captured in the features

How Features Flow to Actions

backbone.encode(images, text)
    pooled: (1, 2048)           ← All visual + language understanding compressed
    concat(pooled, proprio_features, audio_features)
    fusion: Linear → GELU → Dropout
    action_heads: (1, 2048) → (1, 16, 17)   ← 16 future timesteps

Why pooled, not full hidden states?

Token-level features are variable-length and rich, but the action heads need a fixed-size input. Mean pooling compresses all understanding into one vector. Future work: cross-attention between action queries and backbone tokens (DETR-style).


LoRA Fine-Tuning (Advanced)

When you need domain-specific visual understanding — underwater, industrial, surgical — unfreeze the backbone with LoRA:

config = NeonConfig(
    backbone=BackboneConfig(
        model_id="Qwen/Qwen2.5-VL-7B-Instruct",
        freeze_backbone=False,  # Unfreeze
    ),
)

train_config = TrainConfig(
    use_lora=True,
    lora_r=32,
    lora_alpha=64,
    lora_target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

LoRA adds ~0.5% trainable parameters. Only use this when:

  • Your visual environment is very different from internet video
  • You have 100K+ episodes of training data
  • You have A100-class GPUs with 40+ GB VRAM

Next: Action Heads — the decoders that turn features into movement