Video Backbone¶

The frozen brain. A video foundation model that already understands how the physical world moves — we just give it eyes and connect it to a body.

Supported Models¶

Six backbones, two families, one interface:

Model	Params	Hidden Size	VRAM (4-bit)	Character
`Qwen/Qwen2.5-VL-3B-Instruct`	3B	2048	~4 GB	Fast prototyping, Jetson-friendly
`Qwen/Qwen2.5-VL-7B-Instruct`	7B	3584	~8 GB	Production quality
`Qwen/Qwen2.5-Omni-3B`	3B	2048	~4 GB	Audio-native — hears and speaks
`Qwen/Qwen2.5-Omni-7B`	7B	3584	~8 GB	Full multimodal — the flagship
`nvidia/Cosmos-Reason2-2B`	2B	1536	~3 GB	Edge physics reasoning
`nvidia/Cosmos-Reason2-8B`	8B	4096	~10 GB	Deep physical world modeling

Configuration¶

from neon.model.video_backbone import BackboneConfig, VideoBackbone

config = BackboneConfig(
    model_id="Qwen/Qwen2.5-VL-3B-Instruct",
    torch_dtype="bfloat16",
    device_map="auto",
    attn_implementation="eager",  # "flash_attention_2" if installed
    fps=4,                         # Video frame rate for temporal input
    max_frames=16,                 # Max frames per video clip
    image_size=224,
    freeze_backbone=True,          # Freeze for action head training
    load_in_4bit=True,             # 4-bit NF4 quantization
    use_double_quant=True,         # Extra savings via double quantization
)

Key Decisions¶

freeze_backbone = True (default): Only action heads train. Fast, data-efficient, recommended unless you have >100K episodes and 40+ GB VRAM.
freeze_backbone = False: Backbone also trains via LoRA. More expressive but needs significantly more data and compute.
load_in_4bit = True: Quantizes backbone weights to 4-bit NF4. Reduces VRAM by ~4× with minimal quality loss. Essential for consumer GPUs.
fps = 4: How many frames per second when processing video. Higher = more temporal detail, more compute. 4 FPS is a good default for 50 Hz control.

Loading and Encoding¶

The backbone loads lazily — no weights downloaded until you need them:

backbone = VideoBackbone(config)
# Nothing downloaded yet — instant creation

backbone.load()
# Now the model is in GPU memory

# Encode a single image + instruction
result = backbone.encode(
    images=[pil_image],
    text="Pick up the red cup",
)
# result["hidden_states"]: (1, seq_len, hidden_size) — full token features
# result["pooled"]:        (1, hidden_size) — mean-pooled representation

Video Input — Temporal Reasoning¶

For tasks that require understanding motion, pass multiple frames:

result = backbone.encode(
    video_frames=[frame_t3, frame_t2, frame_t1, frame_t0],
    text="Catch the moving ball",
)
# The backbone processes all frames with temporal attention
# Motion, velocity, trajectory — all captured in the features

How Features Flow to Actions¶

backbone.encode(images, text)
         │
         ▼
    pooled: (1, 2048)           ← All visual + language understanding compressed
         │
         ▼
    concat(pooled, proprio_features, audio_features)
         │
         ▼
    fusion: Linear → GELU → Dropout
         │
         ▼
    action_heads: (1, 2048) → (1, 16, 17)   ← 16 future timesteps

Why pooled, not full hidden states?

Token-level features are variable-length and rich, but the action heads need a fixed-size input. Mean pooling compresses all understanding into one vector. Future work: cross-attention between action queries and backbone tokens (DETR-style).

LoRA Fine-Tuning (Advanced)¶

When you need domain-specific visual understanding — underwater, industrial, surgical — unfreeze the backbone with LoRA:

config = NeonConfig(
    backbone=BackboneConfig(
        model_id="Qwen/Qwen2.5-VL-7B-Instruct",
        freeze_backbone=False,  # Unfreeze
    ),
)

train_config = TrainConfig(
    use_lora=True,
    lora_r=32,
    lora_alpha=64,
    lora_target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

LoRA adds ~0.5% trainable parameters. Only use this when:

Your visual environment is very different from internet video
You have 100K+ episodes of training data
You have A100-class GPUs with 40+ GB VRAM

→ Next: Action Heads — the decoders that turn features into movement