Skip to content

Self-Learning (Online Adaptation)

Neon includes an online self-supervised adaptation engine that continuously improves action heads during deployment — no human labels needed.

Key Insight

A frozen video backbone that understands physics can serve as both the perception system and the reward model.

How It Works

graph LR
    A[Snapshot Before] -->|f_before| B[Predict Outcome]
    B -->|f_predicted| C[Execute Actions]
    C --> D[Observe After]
    D -->|f_after| E[Compute Losses]
    E --> F[Update Action Heads]
    F -->|backbone frozen| A

The self-learning loop runs every action chunk (16 steps):

  1. SNAPSHOT — Encode observation before action → f_before
  2. PREDICT — Forward model predicts expected outcome features → f_predicted
  3. ACT — Execute action chunk (16 steps)
  4. OBSERVE — Encode observation after action → f_after
  5. LEARN — Compute self-supervised losses
  6. UPDATE — Backprop through action heads only (backbone stays frozen)
  7. PROTECT — EWC regularization prevents catastrophic forgetting

Three Self-Supervised Losses

1. Outcome Prediction Loss

$$\mathcal{L}\text{outcome} = | f\text{predicted} - f_\text{after} |^2$$

"How well did we predict what would happen?" The primary learning signal — high error means surprising outcome with high learning value.

2. Temporal Coherence Loss

$$\mathcal{L}\text{coherence} = 1 - \cos(\Delta f, \, e\text{instruction})$$

"Did the world change in the direction the instruction implied?" Measures alignment between the feature change vector and the instruction embedding.

3. Action Consistency Loss

$$\mathcal{L}\text{consistency} = \text{SmoothL1}(\hat{a}\text{re-predicted}, a_\text{original})$$

"Would we take the same action again given the outcome?" Temporal consistency regularization — the policy should be smooth across similar states.

Quick Start

from neon.model.neon_vla import NeonVLA, NeonConfig
from neon.training.self_learner import NeonSelfLearner, SelfLearnerConfig

# Create model
model = NeonVLA(NeonConfig(action_head_type="flow"))
model.load_backbone()

# Attach self-learner
learner = NeonSelfLearner(model, SelfLearnerConfig(
    strategy="balanced",       # conservative | balanced | aggressive
    learning_rate=1e-5,
    w_ewc=1000.0,             # EWC regularization strength
))

# In the control loop
while running:
    # 1. Snapshot before action
    learner.pre_action(image, instruction, joint_state)

    # 2. Get and execute actions
    output = model.predict(image, instruction=instruction, proprioception=joint_state)
    robot.execute(output.raw_actions)

    # 3. Observe outcome and learn
    metrics = learner.post_action(new_image)
    print(f"Loss: {metrics['total_loss']:.4f}, Confidence: {metrics['confidence']:.2f}")

# Save adapted model
learner.save("adapted_model/")

Adaptation Strategies

Strategy Confidence Threshold Best For
conservative 0.8 Production deployment, safety-critical
balanced 0.6 General use (default)
aggressive 0.3 Rapid adaptation, lab environments

Safety Mechanisms

Elastic Weight Consolidation (EWC)

Prevents catastrophic forgetting by penalizing deviations from pre-trained weights:

$$\mathcal{L}_\text{EWC} = \frac{\lambda}{2} \sum_i F_i (\theta_i - \theta*_i)2$$

where $F_i$ is the Fisher Information diagonal — critical parameters are protected more strongly.

Additional Safety

  • Confidence gating — Only learn from high-confidence assessments
  • Learning rate warmup — Start conservative, increase as confidence grows
  • Gradient clipping — Max grad norm of 1.0
  • Checkpoint & rollback — Revert if loss degrades beyond threshold
  • Prioritized replay — Novel situations get more learning attention

Prioritized Experience Replay

The replay buffer stores experiences and prioritizes surprising ones:

# Experiences with higher prediction error are replayed more often
buffer = PrioritizedReplayBuffer(capacity=1000, alpha=0.6)

# After learning, update priorities based on TD error
buffer.update_priorities(indices, td_errors)

Configuration Reference

@dataclass
class SelfLearnerConfig:
    enabled: bool = True
    learning_rate: float = 1e-5          # Base learning rate
    lr_warmup_steps: int = 50            # Warmup period
    max_lr: float = 5e-5                 # Peak LR after warmup
    strategy: str = "balanced"           # Adaptation aggressiveness

    # Confidence gating
    confidence_threshold: float = 0.6
    min_feature_change: float = 0.01     # Skip if robot didn't move
    max_feature_change: float = 5.0      # Skip if scene glitch

    # Loss weights
    w_outcome: float = 1.0
    w_coherence: float = 0.5
    w_consistency: float = 0.3
    w_ewc: float = 1000.0               # EWC strength

    # Replay
    buffer_size: int = 1000
    batch_size: int = 8
    replay_every: int = 10

    # Safety
    max_grad_norm: float = 1.0
    checkpoint_every: int = 50
    rollback_threshold: float = 0.2

What Makes This Novel

Approach Reward Source Human-in-Loop Separate Model
RLHF/PPO Learned reward model
DAgger Expert demonstrations
Test-Time Training Self-supervised on input
Neon Self-Learning Self-supervised on outcome

Neon uses the frozen video backbone as a zero-shot critic — measuring whether actions produced the intended physical effect in the world. No separate reward model, no human expert.

References

  • Kirkpatrick et al. "Overcoming Catastrophic Forgetting in Neural Networks" (EWC, PNAS 2017)
  • Sun et al. "Test-Time Training with Self-Supervision" (NeurIPS 2020)
  • Chebotar et al. "Autonomous Improvement of Robot Policies" (CoRL 2021)
  • Sun et al. "Learning to (Learn at Test Time)" (ICML 2024)
  • Hansen et al. "TD-MPC2: Scalable, Robust World Models" (ICLR 2024)