Neon Takes Flight¶
Teaching a ground robot's brain to fly — without retraining a single weight.
The Paper That Changed Everything¶
Last week, Stanford and Physical Intelligence dropped AirVLA — a paper that made us sit up straight:
"π, But Make It Fly: Physics-Guided Transfer of VLA Models to Aerial Manipulation" — Tucker et al., arXiv:2603.25038
Their core claim: take a VLA trained entirely on ground robot data, and deploy it on a drone — by injecting physics constraints at inference time, not at training time.
No fine-tuning. No new data. Just math.
We read it three times. Then we started adapting.
What We Took (and Why)¶
1. Aerial Action Space¶
Neon's action space was built for the Unitree G1 — 29 joints, all revolute, all firmly on the ground. AirVLA showed us that the same framework can encode:
| Mode | DoF | What It Controls |
|---|---|---|
AERIAL_NAV |
4 | x, y, z, yaw — pure navigation |
AERIAL_MANIP |
5 | + gripper — pick things mid-air |
AERIAL_FULL |
7 | + roll, pitch — full 6-DoF + gripper |
from neon.data.action_space import ControlMode
# Same Robot class, different mode
action_space = G1ActionSpace(mode=ControlMode.AERIAL_MANIP)
The key insight: action spaces are just structured vectors with limits. A drone's (x, y, z, yaw, gripper) is no different from a humanoid's (shoulder_pitch, elbow, wrist_roll) — both are bounded, both need normalization, both feed into the same flow-matching decoder.
2. Physics-Aware Guidance (The Real Magic)¶
This is the headline feature. AirVLA's §3.4 introduces a beautiful idea:
Instead of retraining for new physics, inject gradient corrections into the flow-matching sampling process.
The math is elegant:
Where Φ is any differentiable loss over the denoised action chunk. The flow-matching sampler already iterates — you just add a gradient nudge at each step.
We implemented four guidance functions:
PayloadAwareGuidance¶
When a drone picks something up, it sags. AirVLA's solution: detect gripper closure, then bias altitude upward.
guidance = PayloadAwareGuidance(
altitude_index=2, # z dimension
gripper_index=4, # gripper aperture
target_altitude_offset=0.05, # 5cm up when carrying
)
# At inference time — NO retraining
actions = flow_head.sample(
features,
guidance_fn=guidance,
guidance_scale=2.0,
)
The loss is dead simple:
Where α is payload confidence (0 when gripper open, 1 when closed). When α = 0, it vanishes — pure vanilla flow matching. When α = 1, the sampler prefers altitude-biased chunks.
JointTorqueSafetyGuidance¶
This one's a Neon original — we generalized AirVLA's aerial guidance to ground robots. Penalizes actions that exceed joint velocity or torque limits:
guidance = JointTorqueSafetyGuidance.from_action_space(g1_action_space)
actions = flow_head.sample(features, guidance_fn=guidance, guidance_scale=1.0)
Deploy a sim-trained model on real hardware with tighter safety margins. No retraining.
CollisionAvoidanceGuidance¶
Distance-field guidance for obstacle avoidance:
guidance = CollisionAvoidanceGuidance(
obstacle_positions=lidar_points, # (N, 3) from perception
safe_distance=0.15, # 15cm safety bubble
)
CompositeGuidance¶
The real power: compose them all. Different weights, different concerns, one sampling loop:
guidance = CompositeGuidance([
(PayloadAwareGuidance(...), 2.0), # Strong altitude bias
(CollisionAvoidanceGuidance(...), 1.0), # Medium obstacle avoidance
(SmoothTrajectoryGuidance(dt=0.1), 0.5), # Light jerk reduction
])
3. Real-Time Chunking (RTC) for Drones¶
We already had RTC for the G1 — overlapping action chunks with smooth blending. AirVLA confirmed that the same technique dramatically reduces oscillations at chunk boundaries for aerial platforms too.
The key difference: drones have faster dynamics. A humanoid can tolerate 50ms of jitter; a quadrotor in hover cannot. AirVLA uses 2-step RTC overlap where we use 4-step for the G1.
Why This Matters for Neon¶
The Frozen Backbone Thesis, Validated¶
Neon's architecture is built on a bet: freeze a video foundation model, train only a tiny decoder. AirVLA validates this from a completely different angle:
| AirVLA | Neon | |
|---|---|---|
| Base model | π₀ (flow-matching VLA) | Qwen2.5-Omni (video FM) |
| Frozen? | No (but guidance is post-hoc) | Yes (100% frozen) |
| Physics injection | Inference-time guidance | Same — guidance.py |
| Action decoder | Flow-matching head | Flow-matching head |
| Key insight | Don't retrain for new physics | Don't retrain for new modalities |
Both projects converge on the same truth: the best robot brains are pre-trained on non-robot data, and we should stop trying to bake everything into the weights.
One VLA, Every Platform¶
With aerial action spaces + physics guidance, Neon's roadmap becomes:
Humanoid (G1) ─── neon.model ─── FlowMatchingHead ─── 29 DoF
Drone (nav) ─── neon.model ─── FlowMatchingHead ─── 4 DoF + guidance
Drone (manip) ─── neon.model ─── FlowMatchingHead ─── 5 DoF + guidance
Quadruped ─── neon.model ─── FlowMatchingHead ─── 12 DoF + guidance
Mobile manipulator ── neon.model ─── FlowMatchingHead ─── 10 DoF + guidance
Same frozen backbone. Same video understanding. Different decoders, different guidance functions. One brain, many bodies.
The Code¶
Everything is in the repo:
| File | What |
|---|---|
neon/model/guidance.py |
All guidance functions (5 classes, ~350 lines) |
neon/data/action_space.py |
Aerial control modes added |
tests/test_airvla_integration.py |
Full test coverage |
Install and try:
from neon.model.guidance import (
PayloadAwareGuidance,
JointTorqueSafetyGuidance,
CollisionAvoidanceGuidance,
CompositeGuidance,
)
# Your drone picks up a package — guidance compensates for sag
guidance = PayloadAwareGuidance(
altitude_index=2,
gripper_index=4,
target_altitude_offset=0.05,
)
guidance.update_state(gripper_aperture=0.1) # Gripper mostly closed
# Inject into flow-matching sampling — zero retraining
actions = flow_head.sample(
features,
guidance_fn=guidance,
guidance_scale=2.0,
)
What's Next¶
- Real drone testing — We need to validate on actual hardware (Crazyflie or DJI platform)
- SDF guidance — Replace point obstacles with signed distance fields from NeRF/3DGS
- Force feedback — Use F/T sensor readings as additional guidance signal
- Multi-robot guidance — Composite guidance across a humanoid-drone team
References¶
- Tucker et al., "π, But Make It Fly: Physics-Guided Transfer of VLA Models to Aerial Manipulation", arXiv:2603.25038, 2026. airvla.github.io
- Black et al., "π₀: A Vision-Language-Action Flow Model for General Robot Control", 2024.
- Lipman et al., "Flow Matching for Generative Modeling", ICLR 2023.
The difference between a ground robot and a flying robot is just a guidance function. The brain is the same.