Video Backbone¶
The frozen brain. A video foundation model that already understands how the physical world moves — we just give it eyes and connect it to a body.
Supported Models¶
Six backbones, two families, one interface:
| Model | Params | Hidden Size | VRAM (4-bit) | Character |
|---|---|---|---|---|
Qwen/Qwen2.5-VL-3B-Instruct |
3B | 2048 | ~4 GB | Fast prototyping, Jetson-friendly |
Qwen/Qwen2.5-VL-7B-Instruct |
7B | 3584 | ~8 GB | Production quality |
Qwen/Qwen2.5-Omni-3B |
3B | 2048 | ~4 GB | Audio-native — hears and speaks |
Qwen/Qwen2.5-Omni-7B |
7B | 3584 | ~8 GB | Full multimodal — the flagship |
nvidia/Cosmos-Reason2-2B |
2B | 1536 | ~3 GB | Edge physics reasoning |
nvidia/Cosmos-Reason2-8B |
8B | 4096 | ~10 GB | Deep physical world modeling |
Configuration¶
from neon.model.video_backbone import BackboneConfig, VideoBackbone
config = BackboneConfig(
model_id="Qwen/Qwen2.5-VL-3B-Instruct",
torch_dtype="bfloat16",
device_map="auto",
attn_implementation="eager", # "flash_attention_2" if installed
fps=4, # Video frame rate for temporal input
max_frames=16, # Max frames per video clip
image_size=224,
freeze_backbone=True, # Freeze for action head training
load_in_4bit=True, # 4-bit NF4 quantization
use_double_quant=True, # Extra savings via double quantization
)
Key Decisions¶
freeze_backbone = True(default)- Only action heads train. Fast, data-efficient, recommended unless you have >100K episodes and 40+ GB VRAM.
freeze_backbone = False- Backbone also trains via LoRA. More expressive but needs significantly more data and compute.
load_in_4bit = True- Quantizes backbone weights to 4-bit NF4. Reduces VRAM by ~4× with minimal quality loss. Essential for consumer GPUs.
fps = 4- How many frames per second when processing video. Higher = more temporal detail, more compute. 4 FPS is a good default for 50 Hz control.
Loading and Encoding¶
The backbone loads lazily — no weights downloaded until you need them:
backbone = VideoBackbone(config)
# Nothing downloaded yet — instant creation
backbone.load()
# Now the model is in GPU memory
# Encode a single image + instruction
result = backbone.encode(
images=[pil_image],
text="Pick up the red cup",
)
# result["hidden_states"]: (1, seq_len, hidden_size) — full token features
# result["pooled"]: (1, hidden_size) — mean-pooled representation
Video Input — Temporal Reasoning¶
For tasks that require understanding motion, pass multiple frames:
result = backbone.encode(
video_frames=[frame_t3, frame_t2, frame_t1, frame_t0],
text="Catch the moving ball",
)
# The backbone processes all frames with temporal attention
# Motion, velocity, trajectory — all captured in the features
How Features Flow to Actions¶
backbone.encode(images, text)
│
▼
pooled: (1, 2048) ← All visual + language understanding compressed
│
▼
concat(pooled, proprio_features, audio_features)
│
▼
fusion: Linear → GELU → Dropout
│
▼
action_heads: (1, 2048) → (1, 16, 17) ← 16 future timesteps
Why pooled, not full hidden states?
Token-level features are variable-length and rich, but the action heads need a fixed-size input. Mean pooling compresses all understanding into one vector. Future work: cross-attention between action queries and backbone tokens (DETR-style).
LoRA Fine-Tuning (Advanced)¶
When you need domain-specific visual understanding — underwater, industrial, surgical — unfreeze the backbone with LoRA:
config = NeonConfig(
backbone=BackboneConfig(
model_id="Qwen/Qwen2.5-VL-7B-Instruct",
freeze_backbone=False, # Unfreeze
),
)
train_config = TrainConfig(
use_lora=True,
lora_r=32,
lora_alpha=64,
lora_target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)
LoRA adds ~0.5% trainable parameters. Only use this when:
- Your visual environment is very different from internet video
- You have 100K+ episodes of training data
- You have A100-class GPUs with 40+ GB VRAM
→ Next: Action Heads — the decoders that turn features into movement