Neon Takes Flight¶

Teaching a ground robot's brain to fly — without retraining a single weight.

The Paper That Changed Everything¶

Last week, Stanford and Physical Intelligence dropped AirVLA — a paper that made us sit up straight:

"π, But Make It Fly: Physics-Guided Transfer of VLA Models to Aerial Manipulation" — Tucker et al., arXiv:2603.25038

Their core claim: take a VLA trained entirely on ground robot data, and deploy it on a drone — by injecting physics constraints at inference time, not at training time.

No fine-tuning. No new data. Just math.

We read it three times. Then we started adapting.

What We Took (and Why)¶

1. Aerial Action Space¶

Neon's action space was built for the Unitree G1 — 29 joints, all revolute, all firmly on the ground. AirVLA showed us that the same framework can encode:

Mode	DoF	What It Controls
`AERIAL_NAV`	4	x, y, z, yaw — pure navigation
`AERIAL_MANIP`	5	+ gripper — pick things mid-air
`AERIAL_FULL`	7	+ roll, pitch — full 6-DoF + gripper

from neon.data.action_space import ControlMode

# Same Robot class, different mode
action_space = G1ActionSpace(mode=ControlMode.AERIAL_MANIP)

The key insight: action spaces are just structured vectors with limits. A drone's (x, y, z, yaw, gripper) is no different from a humanoid's (shoulder_pitch, elbow, wrist_roll) — both are bounded, both need normalization, both feed into the same flow-matching decoder.

2. Physics-Aware Guidance (The Real Magic)¶

This is the headline feature. AirVLA's §3.4 introduces a beautiful idea:

Instead of retraining for new physics, inject gradient corrections into the flow-matching sampling process.

The math is elegant:

v_guided = v_base - λ · ∇_x Φ(denoise(x_t))

Where Φ is any differentiable loss over the denoised action chunk. The flow-matching sampler already iterates — you just add a gradient nudge at each step.

We implemented four guidance functions:

PayloadAwareGuidance¶

When a drone picks something up, it sags. AirVLA's solution: detect gripper closure, then bias altitude upward.

guidance = PayloadAwareGuidance(
    altitude_index=2,          # z dimension
    gripper_index=4,           # gripper aperture
    target_altitude_offset=0.05,  # 5cm up when carrying
)

# At inference time — NO retraining
actions = flow_head.sample(
    features,
    guidance_fn=guidance,
    guidance_scale=2.0,
)

The loss is dead simple:

Φ(A) = α · Σ_t (A[t, z] - z_target)²

Where α is payload confidence (0 when gripper open, 1 when closed). When α = 0, it vanishes — pure vanilla flow matching. When α = 1, the sampler prefers altitude-biased chunks.

JointTorqueSafetyGuidance¶

This one's a Neon original — we generalized AirVLA's aerial guidance to ground robots. Penalizes actions that exceed joint velocity or torque limits:

guidance = JointTorqueSafetyGuidance.from_action_space(g1_action_space)
actions = flow_head.sample(features, guidance_fn=guidance, guidance_scale=1.0)

Deploy a sim-trained model on real hardware with tighter safety margins. No retraining.

CollisionAvoidanceGuidance¶

Distance-field guidance for obstacle avoidance:

guidance = CollisionAvoidanceGuidance(
    obstacle_positions=lidar_points,  # (N, 3) from perception
    safe_distance=0.15,               # 15cm safety bubble
)

CompositeGuidance¶

The real power: compose them all. Different weights, different concerns, one sampling loop:

guidance = CompositeGuidance([
    (PayloadAwareGuidance(...), 2.0),   # Strong altitude bias
    (CollisionAvoidanceGuidance(...), 1.0),  # Medium obstacle avoidance
    (SmoothTrajectoryGuidance(dt=0.1), 0.5),  # Light jerk reduction
])

3. Real-Time Chunking (RTC) for Drones¶

We already had RTC for the G1 — overlapping action chunks with smooth blending. AirVLA confirmed that the same technique dramatically reduces oscillations at chunk boundaries for aerial platforms too.

The key difference: drones have faster dynamics. A humanoid can tolerate 50ms of jitter; a quadrotor in hover cannot. AirVLA uses 2-step RTC overlap where we use 4-step for the G1.

Why This Matters for Neon¶

The Frozen Backbone Thesis, Validated¶

Neon's architecture is built on a bet: freeze a video foundation model, train only a tiny decoder. AirVLA validates this from a completely different angle:

	AirVLA	Neon
Base model	π₀ (flow-matching VLA)	Qwen2.5-Omni (video FM)
Frozen?	No (but guidance is post-hoc)	Yes (100% frozen)
Physics injection	Inference-time guidance	Same — guidance.py
Action decoder	Flow-matching head	Flow-matching head
Key insight	Don't retrain for new physics	Don't retrain for new modalities

Both projects converge on the same truth: the best robot brains are pre-trained on non-robot data, and we should stop trying to bake everything into the weights.

One VLA, Every Platform¶

With aerial action spaces + physics guidance, Neon's roadmap becomes:

Humanoid (G1)     ─── neon.model ─── FlowMatchingHead ─── 29 DoF
Drone (nav)       ─── neon.model ─── FlowMatchingHead ───  4 DoF + guidance
Drone (manip)     ─── neon.model ─── FlowMatchingHead ───  5 DoF + guidance
Quadruped         ─── neon.model ─── FlowMatchingHead ─── 12 DoF + guidance
Mobile manipulator ── neon.model ─── FlowMatchingHead ─── 10 DoF + guidance

Same frozen backbone. Same video understanding. Different decoders, different guidance functions. One brain, many bodies.

The Code¶

Everything is in the repo:

File	What
`neon/model/guidance.py`	All guidance functions (5 classes, ~350 lines)
`neon/data/action_space.py`	Aerial control modes added
`tests/test_airvla_integration.py`	Full test coverage

Install and try:

pip install neon-vla

from neon.model.guidance import (
    PayloadAwareGuidance,
    JointTorqueSafetyGuidance,
    CollisionAvoidanceGuidance,
    CompositeGuidance,
)

# Your drone picks up a package — guidance compensates for sag
guidance = PayloadAwareGuidance(
    altitude_index=2,
    gripper_index=4,
    target_altitude_offset=0.05,
)
guidance.update_state(gripper_aperture=0.1)  # Gripper mostly closed

# Inject into flow-matching sampling — zero retraining
actions = flow_head.sample(
    features,
    guidance_fn=guidance,
    guidance_scale=2.0,
)

What's Next¶

Real drone testing — We need to validate on actual hardware (Crazyflie or DJI platform)
SDF guidance — Replace point obstacles with signed distance fields from NeRF/3DGS
Force feedback — Use F/T sensor readings as additional guidance signal
Multi-robot guidance — Composite guidance across a humanoid-drone team

References¶

Tucker et al., "π, But Make It Fly: Physics-Guided Transfer of VLA Models to Aerial Manipulation", arXiv:2603.25038, 2026. airvla.github.io
Black et al., "π₀: A Vision-Language-Action Flow Model for General Robot Control", 2024.
Lipman et al., "Flow Matching for Generative Modeling", ICLR 2023.

The difference between a ground robot and a flying robot is just a guidance function. The brain is the same.