Skip to content

Brain-Inspired Fusion (TRIBEv2)

Neon borrows neuroscience techniques from Meta's TRIBEv2 — a foundation model that predicts fMRI brain responses to video, audio, and text.

Why Brain Science for Robots?

The human brain is the best multimodal fusion system we know. TRIBEv2 reverse-engineers how the brain combines vision, hearing, and language. We port their key techniques to make Neon's sensor fusion more robust.

Techniques

1. Temporal Smoothing

from neon.model.brain_encoder import TemporalSmoothing

smoother = TemporalSmoothing(dim=17, kernel_size=9, sigma=2.0)
# Input:  (batch, 16, 17) — 16 timesteps, 17 action dims
# Output: (batch, 16, 17) — smoothed action trajectory
smooth_actions = smoother(raw_actions)

What: Gaussian 1D convolution across the time axis.
Why: Eliminates jitter between consecutive action predictions. The robot moves smoothly instead of oscillating.
From TRIBEv2: Used to smooth fMRI predictions across TRs (repetition times).

2. Modality Dropout

from neon.model.brain_encoder import ModalityDropout

mod_drop = ModalityDropout(p=0.15)
features = {
    "vision": vision_feat,      # (batch, 2048)
    "proprio": proprio_feat,    # (batch, 256)
    "audio": audio_feat,        # (batch, 512)
    "lidar": lidar_feat,        # (batch, 256)
}
# During training: randomly zeros entire modalities
robust_features = mod_drop(features)

What: Randomly zero-out entire modality feature vectors (not individual elements).
Why: Forces the model to learn redundant cross-modal representations. If the camera is occluded, the robot uses proprioception + LiDAR. If audio is noisy, it relies on vision.
From TRIBEv2: modality_dropout parameter in FmriEncoder.

3. Embodiment Layers

from neon.model.brain_encoder import EmbodimentLayers

emb_layers = EmbodimentLayers(
    in_channels=2048,
    out_channels=2048,
    num_embodiments=8,
)
# Per-robot adaptation: G1 gets different scale+shift than Franka
g1_features = emb_layers(features, embodiment_id=torch.tensor([0]))
franka_features = emb_layers(features, embodiment_id=torch.tensor([3]))

What: Per-embodiment affine transform (scale γ + shift β) on features.
Why: One base model works across robot types. Each embodiment gets minimal per-robot adaptation (~4K params) instead of full fine-tuning.
From TRIBEv2: SubjectLayers — each brain (subject) gets its own affine transform.

4. Temporal Dropout

from neon.model.brain_encoder import TemporalDropout

temp_drop = TemporalDropout(p=0.1)
# Randomly zeros entire timesteps — simulates dropped frames
robust_sequence = temp_drop(sequence_features)  # (batch, time, dim)

What: Randomly zero entire timesteps in the input sequence.
Why: Simulates dropped camera frames, network delays, sensor lag. Teaches the action chunking head to interpolate gaps.
From TRIBEv2: temporal_dropout parameter in FmriEncoder.

5. BrainFusion Module

All techniques combined into a drop-in replacement for the simple linear fusion:

from neon.model.brain_encoder import BrainFusion

fusion = BrainFusion(
    modality_dims={
        "vision": 2048,
        "proprio": 256,
        "audio": 512,
        "lidar": 256,
        "eef": 128,
    },
    output_dim=2048,
    modality_dropout=0.15,
    temporal_dropout=0.05,
    temporal_smoothing=True,
    smoothing_kernel=5,
    smoothing_sigma=1.5,
    num_embodiments=8,
)

fused = fusion(
    modality_features={"vision": v, "proprio": p, "audio": a, "lidar": l, "eef": e},
    embodiment_id=torch.tensor([0]),  # G1
)

Data: Brain Codes as Auxiliary Signal

TRIBEv2 can predict fMRI responses to video stimuli. We use these as an auxiliary training signal:

# In training config
data_soup:
  sources:
    - name: brain_codes
      type: tribev2_fmri
      path: facebook/tribev2
      weight: 0.05  # Light auxiliary signal
      max_samples: 500

The brain activation predictions are compressed via PCA to 64-dimensional "brain codes" and used as an auxiliary prediction target. Even synthetic brain codes (HRF-convolved noise) act as a temporal regularizer.

Architecture Diagram

Camera ──→ VideoBackbone ──→ vision_feat (2048d)
Audio  ──→ AudioEncoder  ──→ audio_feat  (512d)
Joints ──→ ProprioEnc    ──→ proprio_feat (256d)   ──→ BrainFusion ──→ ActionHead
                                    │                    │ modality_dropout
LiDAR  ──→ PointCloudEnc ──→ lidar_feat  (256d)        │ embodiment_layers
                                    │                    │ temporal_dropout
EEF    ──→ EEFEncoder    ──→ eef_feat    (128d)         │ temporal_smoothing
                                                    fused (2048d)

References

  • d'Ascoli et al. "A foundation model of vision, audition, and language for in-silico neuroscience." Meta AI, 2026. Paper | Code | Weights
  • Key files ported: tribev2/model.py (TemporalSmoothing, FmriEncoder, SubjectLayers, modality_dropout, temporal_dropout)