Skip to content

Neon Takes Flight

Teaching a ground robot's brain to fly — without retraining a single weight.


The Paper That Changed Everything

Last week, Stanford and Physical Intelligence dropped AirVLA — a paper that made us sit up straight:

"π, But Make It Fly: Physics-Guided Transfer of VLA Models to Aerial Manipulation" — Tucker et al., arXiv:2603.25038

Their core claim: take a VLA trained entirely on ground robot data, and deploy it on a drone — by injecting physics constraints at inference time, not at training time.

No fine-tuning. No new data. Just math.

We read it three times. Then we started adapting.


What We Took (and Why)

1. Aerial Action Space

Neon's action space was built for the Unitree G1 — 29 joints, all revolute, all firmly on the ground. AirVLA showed us that the same framework can encode:

Mode DoF What It Controls
AERIAL_NAV 4 x, y, z, yaw — pure navigation
AERIAL_MANIP 5 + gripper — pick things mid-air
AERIAL_FULL 7 + roll, pitch — full 6-DoF + gripper
from neon.data.action_space import ControlMode

# Same Robot class, different mode
action_space = G1ActionSpace(mode=ControlMode.AERIAL_MANIP)

The key insight: action spaces are just structured vectors with limits. A drone's (x, y, z, yaw, gripper) is no different from a humanoid's (shoulder_pitch, elbow, wrist_roll) — both are bounded, both need normalization, both feed into the same flow-matching decoder.

2. Physics-Aware Guidance (The Real Magic)

This is the headline feature. AirVLA's §3.4 introduces a beautiful idea:

Instead of retraining for new physics, inject gradient corrections into the flow-matching sampling process.

The math is elegant:

v_guided = v_base - λ · ∇_x Φ(denoise(x_t))

Where Φ is any differentiable loss over the denoised action chunk. The flow-matching sampler already iterates — you just add a gradient nudge at each step.

We implemented four guidance functions:

PayloadAwareGuidance

When a drone picks something up, it sags. AirVLA's solution: detect gripper closure, then bias altitude upward.

guidance = PayloadAwareGuidance(
    altitude_index=2,          # z dimension
    gripper_index=4,           # gripper aperture
    target_altitude_offset=0.05,  # 5cm up when carrying
)

# At inference time — NO retraining
actions = flow_head.sample(
    features,
    guidance_fn=guidance,
    guidance_scale=2.0,
)

The loss is dead simple:

Φ(A) = α · Σ_t (A[t, z] - z_target)²

Where α is payload confidence (0 when gripper open, 1 when closed). When α = 0, it vanishes — pure vanilla flow matching. When α = 1, the sampler prefers altitude-biased chunks.

JointTorqueSafetyGuidance

This one's a Neon original — we generalized AirVLA's aerial guidance to ground robots. Penalizes actions that exceed joint velocity or torque limits:

guidance = JointTorqueSafetyGuidance.from_action_space(g1_action_space)
actions = flow_head.sample(features, guidance_fn=guidance, guidance_scale=1.0)

Deploy a sim-trained model on real hardware with tighter safety margins. No retraining.

CollisionAvoidanceGuidance

Distance-field guidance for obstacle avoidance:

guidance = CollisionAvoidanceGuidance(
    obstacle_positions=lidar_points,  # (N, 3) from perception
    safe_distance=0.15,               # 15cm safety bubble
)

CompositeGuidance

The real power: compose them all. Different weights, different concerns, one sampling loop:

guidance = CompositeGuidance([
    (PayloadAwareGuidance(...), 2.0),   # Strong altitude bias
    (CollisionAvoidanceGuidance(...), 1.0),  # Medium obstacle avoidance
    (SmoothTrajectoryGuidance(dt=0.1), 0.5),  # Light jerk reduction
])

3. Real-Time Chunking (RTC) for Drones

We already had RTC for the G1 — overlapping action chunks with smooth blending. AirVLA confirmed that the same technique dramatically reduces oscillations at chunk boundaries for aerial platforms too.

The key difference: drones have faster dynamics. A humanoid can tolerate 50ms of jitter; a quadrotor in hover cannot. AirVLA uses 2-step RTC overlap where we use 4-step for the G1.


Why This Matters for Neon

The Frozen Backbone Thesis, Validated

Neon's architecture is built on a bet: freeze a video foundation model, train only a tiny decoder. AirVLA validates this from a completely different angle:

AirVLA Neon
Base model π₀ (flow-matching VLA) Qwen2.5-Omni (video FM)
Frozen? No (but guidance is post-hoc) Yes (100% frozen)
Physics injection Inference-time guidance Same — guidance.py
Action decoder Flow-matching head Flow-matching head
Key insight Don't retrain for new physics Don't retrain for new modalities

Both projects converge on the same truth: the best robot brains are pre-trained on non-robot data, and we should stop trying to bake everything into the weights.

One VLA, Every Platform

With aerial action spaces + physics guidance, Neon's roadmap becomes:

Humanoid (G1)     ─── neon.model ─── FlowMatchingHead ─── 29 DoF
Drone (nav)       ─── neon.model ─── FlowMatchingHead ───  4 DoF + guidance
Drone (manip)     ─── neon.model ─── FlowMatchingHead ───  5 DoF + guidance
Quadruped         ─── neon.model ─── FlowMatchingHead ─── 12 DoF + guidance
Mobile manipulator ── neon.model ─── FlowMatchingHead ─── 10 DoF + guidance

Same frozen backbone. Same video understanding. Different decoders, different guidance functions. One brain, many bodies.


The Code

Everything is in the repo:

File What
neon/model/guidance.py All guidance functions (5 classes, ~350 lines)
neon/data/action_space.py Aerial control modes added
tests/test_airvla_integration.py Full test coverage

Install and try:

pip install neon-vla
from neon.model.guidance import (
    PayloadAwareGuidance,
    JointTorqueSafetyGuidance,
    CollisionAvoidanceGuidance,
    CompositeGuidance,
)

# Your drone picks up a package — guidance compensates for sag
guidance = PayloadAwareGuidance(
    altitude_index=2,
    gripper_index=4,
    target_altitude_offset=0.05,
)
guidance.update_state(gripper_aperture=0.1)  # Gripper mostly closed

# Inject into flow-matching sampling — zero retraining
actions = flow_head.sample(
    features,
    guidance_fn=guidance,
    guidance_scale=2.0,
)

What's Next

  • Real drone testing — We need to validate on actual hardware (Crazyflie or DJI platform)
  • SDF guidance — Replace point obstacles with signed distance fields from NeRF/3DGS
  • Force feedback — Use F/T sensor readings as additional guidance signal
  • Multi-robot guidance — Composite guidance across a humanoid-drone team

References

  1. Tucker et al., "π, But Make It Fly: Physics-Guided Transfer of VLA Models to Aerial Manipulation", arXiv:2603.25038, 2026. airvla.github.io
  2. Black et al., "π₀: A Vision-Language-Action Flow Model for General Robot Control", 2024.
  3. Lipman et al., "Flow Matching for Generative Modeling", ICLR 2023.

The difference between a ground robot and a flying robot is just a guidance function. The brain is the same.