Brain-Inspired Fusion (TRIBEv2)¶
Neon borrows neuroscience techniques from Meta's TRIBEv2 — a foundation model that predicts fMRI brain responses to video, audio, and text.
Why Brain Science for Robots?¶
The human brain is the best multimodal fusion system we know. TRIBEv2 reverse-engineers how the brain combines vision, hearing, and language. We port their key techniques to make Neon's sensor fusion more robust.
Techniques¶
1. Temporal Smoothing¶
from neon.model.brain_encoder import TemporalSmoothing
smoother = TemporalSmoothing(dim=17, kernel_size=9, sigma=2.0)
# Input: (batch, 16, 17) — 16 timesteps, 17 action dims
# Output: (batch, 16, 17) — smoothed action trajectory
smooth_actions = smoother(raw_actions)
What: Gaussian 1D convolution across the time axis.
Why: Eliminates jitter between consecutive action predictions. The robot moves smoothly instead of oscillating.
From TRIBEv2: Used to smooth fMRI predictions across TRs (repetition times).
2. Modality Dropout¶
from neon.model.brain_encoder import ModalityDropout
mod_drop = ModalityDropout(p=0.15)
features = {
"vision": vision_feat, # (batch, 2048)
"proprio": proprio_feat, # (batch, 256)
"audio": audio_feat, # (batch, 512)
"lidar": lidar_feat, # (batch, 256)
}
# During training: randomly zeros entire modalities
robust_features = mod_drop(features)
What: Randomly zero-out entire modality feature vectors (not individual elements).
Why: Forces the model to learn redundant cross-modal representations. If the camera is occluded, the robot uses proprioception + LiDAR. If audio is noisy, it relies on vision.
From TRIBEv2: modality_dropout parameter in FmriEncoder.
3. Embodiment Layers¶
from neon.model.brain_encoder import EmbodimentLayers
emb_layers = EmbodimentLayers(
in_channels=2048,
out_channels=2048,
num_embodiments=8,
)
# Per-robot adaptation: G1 gets different scale+shift than Franka
g1_features = emb_layers(features, embodiment_id=torch.tensor([0]))
franka_features = emb_layers(features, embodiment_id=torch.tensor([3]))
What: Per-embodiment affine transform (scale γ + shift β) on features.
Why: One base model works across robot types. Each embodiment gets minimal per-robot adaptation (~4K params) instead of full fine-tuning.
From TRIBEv2: SubjectLayers — each brain (subject) gets its own affine transform.
4. Temporal Dropout¶
from neon.model.brain_encoder import TemporalDropout
temp_drop = TemporalDropout(p=0.1)
# Randomly zeros entire timesteps — simulates dropped frames
robust_sequence = temp_drop(sequence_features) # (batch, time, dim)
What: Randomly zero entire timesteps in the input sequence.
Why: Simulates dropped camera frames, network delays, sensor lag. Teaches the action chunking head to interpolate gaps.
From TRIBEv2: temporal_dropout parameter in FmriEncoder.
5. BrainFusion Module¶
All techniques combined into a drop-in replacement for the simple linear fusion:
from neon.model.brain_encoder import BrainFusion
fusion = BrainFusion(
modality_dims={
"vision": 2048,
"proprio": 256,
"audio": 512,
"lidar": 256,
"eef": 128,
},
output_dim=2048,
modality_dropout=0.15,
temporal_dropout=0.05,
temporal_smoothing=True,
smoothing_kernel=5,
smoothing_sigma=1.5,
num_embodiments=8,
)
fused = fusion(
modality_features={"vision": v, "proprio": p, "audio": a, "lidar": l, "eef": e},
embodiment_id=torch.tensor([0]), # G1
)
Data: Brain Codes as Auxiliary Signal¶
TRIBEv2 can predict fMRI responses to video stimuli. We use these as an auxiliary training signal:
# In training config
data_soup:
sources:
- name: brain_codes
type: tribev2_fmri
path: facebook/tribev2
weight: 0.05 # Light auxiliary signal
max_samples: 500
The brain activation predictions are compressed via PCA to 64-dimensional "brain codes" and used as an auxiliary prediction target. Even synthetic brain codes (HRF-convolved noise) act as a temporal regularizer.
Architecture Diagram¶
Camera ──→ VideoBackbone ──→ vision_feat (2048d)
│
Audio ──→ AudioEncoder ──→ audio_feat (512d)
│
Joints ──→ ProprioEnc ──→ proprio_feat (256d) ──→ BrainFusion ──→ ActionHead
│ │ modality_dropout
LiDAR ──→ PointCloudEnc ──→ lidar_feat (256d) │ embodiment_layers
│ │ temporal_dropout
EEF ──→ EEFEncoder ──→ eef_feat (128d) │ temporal_smoothing
↓
fused (2048d)