Benchmarks¶
How Neon compares against GR00T N1, OmniVLA, and Cosmos-Predict2.5 — and how to reproduce every number.
The Comparison Table¶
This is the table we're filling. TBD cells are measured by running benchmarks/run_benchmarks.py.
| GR00T N1 | OmniVLA | Cosmos 2.5 | Neon (7M) | Neon (44M) | |
|---|---|---|---|---|---|
| Checkpoint size | ~2 GB | ~14 GB | N/A | ~25 MB | ~100 MB |
| Trainable params | 42M | ~7B | N/A | 7M | 44M |
| Training time | Days | Days | N/A | 18h | 45h |
| Training cost | $$$ | $$$ | N/A | $45 | $111 |
| Inference (Jetson) | ~80ms | ❌ OOM | ❌ | TBD | TBD |
| Inference (4090) | ~30ms | ~200ms | ~500ms | TBD | TBD |
| SIMPLER avg | TBD | TBD | N/A | TBD | TBD |
| Agibot MSE | TBD | TBD | N/A | TBD | TBD |
| Unseen zero-shot | ❌ needs FT | Limited | ✅ video only | TBD | TBD |
| Audio commands | ❌ | ❌ | ❌ | ✅ | ✅ |
| RTC blending | ❌ | ❌ | N/A | ✅ | ✅ |
| Open source | Partial | ✅ | Partial | ✅ | ✅ |
Five Benchmark Suites¶
1. SIMPLER — Simulated Manipulation¶
Cross-embodiment evaluation from Open X-Embodiment. Tests whether data soup training generalizes to different robot bodies.
2. LIBERO — 130 Multi-Task Evaluation¶
Compositional reasoning, spatial understanding, long-horizon planning.
3. Agibot-World — Bimanual Manipulation¶
In-distribution sanity check on held-out data. Per-group MSE for left arm, right arm, locomotion.
4. Unseen Zero-Shot — The Killer Benchmark¶
12 tasks deliberately excluded from training data. Tests whether the video foundation model backbone transfers general world knowledge to action prediction.
| Category | Example Task | Why It's Hard |
|---|---|---|
| Novel objects | "Pick up the rubber duck" | Not in bridge/agibot data |
| Novel instructions | "Stack blocks by color" | Compositional reasoning |
| Novel physics | "Pour water into the cup" | Fluid dynamics |
| Novel environments | "Open the microwave" | Unseen affordances |
5. Efficiency — Latency, VRAM, Size¶
Practical deployment metrics that differentiate Neon from larger VLAs.
Run Everything¶
# All benchmarks, all metrics
python benchmarks/run_benchmarks.py \
--model cagataydev/neon-g1-v1 \
--suite all \
--output benchmarks/results/
# On HuggingFace GPU
hf jobs uv run --flavor a100-large --secrets HF_TOKEN --timeout 6h \
benchmarks/run_benchmarks.py \
--model cagataydev/neon-g1-v1 \
--suite all
Results are saved as JSON in benchmarks/results/ and automatically pushed to the HuggingFace model card.
Training Tiers¶
Three training configs from fast iteration to maximum scale:
| Tier | Config | Data | Heads | Time | Cost |
|---|---|---|---|---|---|
| v1-dev | benchmarks/configs/v1_dev.yaml |
171K samples | 7M | 18h | $45 |
| v1-release | benchmarks/configs/v1_release.yaml |
171K samples | 44M | 45h | $111 |
| v1-full | benchmarks/configs/v1_full.yaml |
2.15M samples | 44M | 336h | $1,680 |
Train v1-dev first. Evaluate. If benchmarks pass, train v1-release. Details in benchmarks/PLAN.md.
Neon's Thesis¶
The benchmarks are designed to test five claims:
- Smallest usable VLA — 25 MB checkpoint, runs on Jetson → Efficiency suite
- Video backbone = zero-shot generalization — trained on bridge, works on unseen tasks → Unseen Zero-Shot suite
- Audio-native — speak to your robot → Qualitative demo
- Train in a day, not a week — frozen backbone, only heads train → Training time measurement
- RTC-enabled — smooth real-time control → Latency + trajectory smoothness
The benchmarks prove or disprove these claims. We write the paper second.
→ Back to Evaluation Guide for detailed eval metrics and visualization