intbot_edge_vlm · Thor Real-Time VLM¶

The flagship recipe. Deploys Cosmos-Reason2 to Jetson AGX Thor for real-time robot perception with <200 ms/frame latency.

Adapted from cookbook/intbot_edge_vlm.

Architecture¶

sequenceDiagram
    participant Camera
    participant Thor as Jetson Thor
    participant Server as TRT-EdgeLLM
    participant NATS as NATS bus
    participant Robot as Robot controller

    Camera->>Thor: H.264 RTP stream (5600/udp)
    Thor->>Thor: gst: RTP→JPEG (HW decode)
    Thor->>Server: POST /v1/chat/completions {jpeg_b64, prompt}
    Server->>Server: Visual encoder (TRT) + LLM (TRT, FP8)
    Server-->>Thor: {text, usage}
    Thor->>NATS: publish perception.vlm {text, ts}
    NATS-->>Robot: subscribe perception.vlm

Prerequisites¶

x86 GPU host with tensorrt-edgellm-* CLI
Jetson AGX Thor with TensorRT-LLM built
RTP camera source (can be a laptop during testing)
NATS server (optional, docker run -p 4222:4222 nats)

Step 1 — Prep the model on x86¶

just prep-edge-model reason2-2b ./models/R2-fp8
# → ./models/R2-fp8/hf/        (HF weights, ~4 GB)
# → ./models/R2-fp8/quantized/ (FP8 weights, ~1.5 GB)
# → ./models/R2-fp8/onnx/      (LLM ONNX + visual encoder ONNX)

Copy ONNX to Thor:

scp -r ./models/R2-fp8/onnx cagatay@thor.local:~/R2-fp8-onnx

Step 2 — Build engines on Thor¶

ssh cagatay@thor.local
cd ~/thor-cosmos
just build-engines ~/R2-fp8-onnx ~/R2-fp8-engines
# → ~/R2-fp8-engines/llm/     (TRT engine, ~1.8 GB)
# → ~/R2-fp8-engines/visual/  (TRT engine, ~800 MB)

Engine build takes ~20-30 minutes on Thor — this is one-time.

Step 3 — Serve¶

just serve-start ~/R2-fp8-engines/llm ~/R2-fp8-engines/visual
just serve-status
# 🟢 running pid=12345  http://127.0.0.1:8080

Tail logs:

just serve-logs 40

Step 4 — Smoke test¶

# With a static image:
just infer assets/test.jpg "count people"
# "There are 3 people in the scene."

# With a live frame:
just rtp-capture 5600 /tmp/live.jpg 800 600 5
just infer /tmp/live.jpg "describe the scene in JSON"

Step 5 — Real-time loop¶

tmux new -s perception
just perception-loop perception.vlm \
  "Describe the scene; count people; report clothing colors."

Expected throughput: 3-5 FPS on Thor at 800×600 with 128-token output.

Step 6 — Consume events¶

# In another terminal:
nats sub perception.vlm
# [#1] Received on "perception.vlm"
# {"text": "Two workers in high-vis vests...", "ts": 1715000000}

Or subscribe from the robot controller:

import nats, json

async def on_msg(msg):
    data = json.loads(msg.data)
    print(data["text"])

async def main():
    nc = await nats.connect("nats://thor.local:4222")
    await nc.subscribe("perception.vlm", cb=on_msg)

Tuning for latency¶

Knob	Default	Latency-optimized
`max_tokens`	256	64-128
`temperature`	0.2	0.0
Prompt length	—	keep <300 chars
`max_image_tokens` (engine)	10240	4096 if image is small

Troubleshooting¶

cannot reach VLM server → just serve-status. If 🔴, check just serve-logs.

no frame captured → verify RTP_BIND matches the interface the camera sends to. Try hw_decode=False indirectly by rebuilding with avdec_h264 path (automatic fallback in recipe).

High latency → check nvpmodel -q is MAXN_SUPER; check thermals with just sysinfo.

OOM building engine → lower max_image_tokens or use a larger swap file.