Self-Learning (Online Adaptation)¶
Neon includes an online self-supervised adaptation engine that continuously improves action heads during deployment — no human labels needed.
Key Insight
A frozen video backbone that understands physics can serve as both the perception system and the reward model.
How It Works¶
graph LR
A[Snapshot Before] -->|f_before| B[Predict Outcome]
B -->|f_predicted| C[Execute Actions]
C --> D[Observe After]
D -->|f_after| E[Compute Losses]
E --> F[Update Action Heads]
F -->|backbone frozen| A
The self-learning loop runs every action chunk (16 steps):
- SNAPSHOT — Encode observation before action →
f_before - PREDICT — Forward model predicts expected outcome features →
f_predicted - ACT — Execute action chunk (16 steps)
- OBSERVE — Encode observation after action →
f_after - LEARN — Compute self-supervised losses
- UPDATE — Backprop through action heads only (backbone stays frozen)
- PROTECT — EWC regularization prevents catastrophic forgetting
Three Self-Supervised Losses¶
1. Outcome Prediction Loss¶
$$\mathcal{L}\text{outcome} = | f\text{predicted} - f_\text{after} |^2$$
"How well did we predict what would happen?" The primary learning signal — high error means surprising outcome with high learning value.
2. Temporal Coherence Loss¶
$$\mathcal{L}\text{coherence} = 1 - \cos(\Delta f, \, e\text{instruction})$$
"Did the world change in the direction the instruction implied?" Measures alignment between the feature change vector and the instruction embedding.
3. Action Consistency Loss¶
$$\mathcal{L}\text{consistency} = \text{SmoothL1}(\hat{a}\text{re-predicted}, a_\text{original})$$
"Would we take the same action again given the outcome?" Temporal consistency regularization — the policy should be smooth across similar states.
Quick Start¶
from neon.model.neon_vla import NeonVLA, NeonConfig
from neon.training.self_learner import NeonSelfLearner, SelfLearnerConfig
# Create model
model = NeonVLA(NeonConfig(action_head_type="flow"))
model.load_backbone()
# Attach self-learner
learner = NeonSelfLearner(model, SelfLearnerConfig(
strategy="balanced", # conservative | balanced | aggressive
learning_rate=1e-5,
w_ewc=1000.0, # EWC regularization strength
))
# In the control loop
while running:
# 1. Snapshot before action
learner.pre_action(image, instruction, joint_state)
# 2. Get and execute actions
output = model.predict(image, instruction=instruction, proprioception=joint_state)
robot.execute(output.raw_actions)
# 3. Observe outcome and learn
metrics = learner.post_action(new_image)
print(f"Loss: {metrics['total_loss']:.4f}, Confidence: {metrics['confidence']:.2f}")
# Save adapted model
learner.save("adapted_model/")
Adaptation Strategies¶
| Strategy | Confidence Threshold | Best For |
|---|---|---|
conservative |
0.8 | Production deployment, safety-critical |
balanced |
0.6 | General use (default) |
aggressive |
0.3 | Rapid adaptation, lab environments |
Safety Mechanisms¶
Elastic Weight Consolidation (EWC)¶
Prevents catastrophic forgetting by penalizing deviations from pre-trained weights:
$$\mathcal{L}_\text{EWC} = \frac{\lambda}{2} \sum_i F_i (\theta_i - \theta*_i)2$$
where $F_i$ is the Fisher Information diagonal — critical parameters are protected more strongly.
Additional Safety¶
- Confidence gating — Only learn from high-confidence assessments
- Learning rate warmup — Start conservative, increase as confidence grows
- Gradient clipping — Max grad norm of 1.0
- Checkpoint & rollback — Revert if loss degrades beyond threshold
- Prioritized replay — Novel situations get more learning attention
Prioritized Experience Replay¶
The replay buffer stores experiences and prioritizes surprising ones:
# Experiences with higher prediction error are replayed more often
buffer = PrioritizedReplayBuffer(capacity=1000, alpha=0.6)
# After learning, update priorities based on TD error
buffer.update_priorities(indices, td_errors)
Configuration Reference¶
@dataclass
class SelfLearnerConfig:
enabled: bool = True
learning_rate: float = 1e-5 # Base learning rate
lr_warmup_steps: int = 50 # Warmup period
max_lr: float = 5e-5 # Peak LR after warmup
strategy: str = "balanced" # Adaptation aggressiveness
# Confidence gating
confidence_threshold: float = 0.6
min_feature_change: float = 0.01 # Skip if robot didn't move
max_feature_change: float = 5.0 # Skip if scene glitch
# Loss weights
w_outcome: float = 1.0
w_coherence: float = 0.5
w_consistency: float = 0.3
w_ewc: float = 1000.0 # EWC strength
# Replay
buffer_size: int = 1000
batch_size: int = 8
replay_every: int = 10
# Safety
max_grad_norm: float = 1.0
checkpoint_every: int = 50
rollback_threshold: float = 0.2
What Makes This Novel¶
| Approach | Reward Source | Human-in-Loop | Separate Model |
|---|---|---|---|
| RLHF/PPO | Learned reward model | ✅ | ✅ |
| DAgger | Expert demonstrations | ✅ | ❌ |
| Test-Time Training | Self-supervised on input | ❌ | ❌ |
| Neon Self-Learning | Self-supervised on outcome | ❌ | ❌ |
Neon uses the frozen video backbone as a zero-shot critic — measuring whether actions produced the intended physical effect in the world. No separate reward model, no human expert.
References¶
- Kirkpatrick et al. "Overcoming Catastrophic Forgetting in Neural Networks" (EWC, PNAS 2017)
- Sun et al. "Test-Time Training with Self-Supervision" (NeurIPS 2020)
- Chebotar et al. "Autonomous Improvement of Robot Policies" (CoRL 2021)
- Sun et al. "Learning to (Learn at Test Time)" (ICML 2024)
- Hansen et al. "TD-MPC2: Scalable, Robust World Models" (ICLR 2024)