Cosmos-Xenna (Data Curation)¶
Xenna is NVIDIA's 7-stage video data curation pipeline — the same one used to prepare Cosmos training corpora.
Stages¶
- split — shot detection & clipping
- transcode — normalize resolution/fps/codec
- crop — detect and remove borders/letterboxing
- filter — drop low-quality / duplicate / unsafe clips
- caption — auto-label with VLM captions
- dedup — embedding-based near-dupe removal
- shard — pack into WebDataset/TAR shards
Agent tool¶
cosmos_curate(
input_dir="./raw_videos",
output_dir="./outputs/curated",
stages="all", # or "split,transcode,filter"
num_workers=8,
)
CLI¶
just curate ./raw_videos ./outputs/curated all 8
just curate ./raw_videos ./outputs/curated "split,caption" 4
Typical pipelines¶
Full pipeline¶
Captioning only (already cleaned)¶
Test run (subset of stages)¶
Ray cluster¶
Xenna uses Ray for distributed work. The num_workers arg controls parallelism per node. For multi-node, configure Ray in your Cosmos-Xenna clone (see COSMOS_XENNA_REPO).
Output layout¶
outputs/curated/
├── shards/
│ ├── train-000000.tar
│ ├── train-000001.tar
│ └── ...
├── captions/
│ └── captions.parquet
└── stats.json