Self-Learning (Online Adaptation)¶

Neon includes an online self-supervised adaptation engine that continuously improves action heads during deployment — no human labels needed.

Key Insight

A frozen video backbone that understands physics can serve as both the perception system and the reward model.

How It Works¶

graph LR
    A[Snapshot Before] -->|f_before| B[Predict Outcome]
    B -->|f_predicted| C[Execute Actions]
    C --> D[Observe After]
    D -->|f_after| E[Compute Losses]
    E --> F[Update Action Heads]
    F -->|backbone frozen| A

The self-learning loop runs every action chunk (16 steps):

SNAPSHOT — Encode observation before action → f_before
PREDICT — Forward model predicts expected outcome features → f_predicted
ACT — Execute action chunk (16 steps)
OBSERVE — Encode observation after action → f_after
LEARN — Compute self-supervised losses
UPDATE — Backprop through action heads only (backbone stays frozen)
PROTECT — EWC regularization prevents catastrophic forgetting

Three Self-Supervised Losses¶

1. Outcome Prediction Loss¶

$$\mathcal{L}\text{outcome} = | f\text{predicted} - f_\text{after} |^2$$

"How well did we predict what would happen?" The primary learning signal — high error means surprising outcome with high learning value.

2. Temporal Coherence Loss¶

$$\mathcal{L}\text{coherence} = 1 - \cos(\Delta f, \, e\text{instruction})$$

"Did the world change in the direction the instruction implied?" Measures alignment between the feature change vector and the instruction embedding.

3. Action Consistency Loss¶

$$\mathcal{L}\text{consistency} = \text{SmoothL1}(\hat{a}\text{re-predicted}, a_\text{original})$$

"Would we take the same action again given the outcome?" Temporal consistency regularization — the policy should be smooth across similar states.

Quick Start¶

from neon.model.neon_vla import NeonVLA, NeonConfig
from neon.training.self_learner import NeonSelfLearner, SelfLearnerConfig

# Create model
model = NeonVLA(NeonConfig(action_head_type="flow"))
model.load_backbone()

# Attach self-learner
learner = NeonSelfLearner(model, SelfLearnerConfig(
    strategy="balanced",       # conservative | balanced | aggressive
    learning_rate=1e-5,
    w_ewc=1000.0,             # EWC regularization strength
))

# In the control loop
while running:
    # 1. Snapshot before action
    learner.pre_action(image, instruction, joint_state)

    # 2. Get and execute actions
    output = model.predict(image, instruction=instruction, proprioception=joint_state)
    robot.execute(output.raw_actions)

    # 3. Observe outcome and learn
    metrics = learner.post_action(new_image)
    print(f"Loss: {metrics['total_loss']:.4f}, Confidence: {metrics['confidence']:.2f}")

# Save adapted model
learner.save("adapted_model/")

Adaptation Strategies¶

Strategy	Confidence Threshold	Best For
`conservative`	0.8	Production deployment, safety-critical
`balanced`	0.6	General use (default)
`aggressive`	0.3	Rapid adaptation, lab environments

Safety Mechanisms¶

Elastic Weight Consolidation (EWC)¶

Prevents catastrophic forgetting by penalizing deviations from pre-trained weights:

$$\mathcal{L}_\text{EWC} = \frac{\lambda}{2} \sum_i F_i (\theta_i - \theta^*_i)2$$

where $F_i$ is the Fisher Information diagonal — critical parameters are protected more strongly.

Additional Safety¶

Confidence gating — Only learn from high-confidence assessments
Learning rate warmup — Start conservative, increase as confidence grows
Gradient clipping — Max grad norm of 1.0
Checkpoint & rollback — Revert if loss degrades beyond threshold
Prioritized replay — Novel situations get more learning attention

Prioritized Experience Replay¶

The replay buffer stores experiences and prioritizes surprising ones:

# Experiences with higher prediction error are replayed more often
buffer = PrioritizedReplayBuffer(capacity=1000, alpha=0.6)

# After learning, update priorities based on TD error
buffer.update_priorities(indices, td_errors)

Configuration Reference¶

@dataclass
class SelfLearnerConfig:
    enabled: bool = True
    learning_rate: float = 1e-5          # Base learning rate
    lr_warmup_steps: int = 50            # Warmup period
    max_lr: float = 5e-5                 # Peak LR after warmup
    strategy: str = "balanced"           # Adaptation aggressiveness

    # Confidence gating
    confidence_threshold: float = 0.6
    min_feature_change: float = 0.01     # Skip if robot didn't move
    max_feature_change: float = 5.0      # Skip if scene glitch

    # Loss weights
    w_outcome: float = 1.0
    w_coherence: float = 0.5
    w_consistency: float = 0.3
    w_ewc: float = 1000.0               # EWC strength

    # Replay
    buffer_size: int = 1000
    batch_size: int = 8
    replay_every: int = 10

    # Safety
    max_grad_norm: float = 1.0
    checkpoint_every: int = 50
    rollback_threshold: float = 0.2

What Makes This Novel¶

Approach	Reward Source	Human-in-Loop	Separate Model
RLHF/PPO	Learned reward model	✅	✅
DAgger	Expert demonstrations	✅	❌
Test-Time Training	Self-supervised on input	❌	❌
Neon Self-Learning	Self-supervised on outcome	❌	❌

Neon uses the frozen video backbone as a zero-shot critic — measuring whether actions produced the intended physical effect in the world. No separate reward model, no human expert.

References¶

Kirkpatrick et al. "Overcoming Catastrophic Forgetting in Neural Networks" (EWC, PNAS 2017)
Sun et al. "Test-Time Training with Self-Supervision" (NeurIPS 2020)
Chebotar et al. "Autonomous Improvement of Robot Policies" (CoRL 2021)
Sun et al. "Learning to (Learn at Test Time)" (ICML 2024)
Hansen et al. "TD-MPC2: Scalable, Robust World Models" (ICLR 2024)