Evaluation Metrics
Twelve metrics, organized by what they measure.
cosmos_evaluate(
metric="fvd",
pred_path="./outputs/predict2_5",
gt_path="./ground-truth",
output_dir="./outputs/eval",
)
Metric catalogue
Video quality (Predict 2.5)
| Metric |
What |
Script |
fid |
Frechet Inception Distance (image-level) |
scripts/metrics/qualitative/compute_fid.py |
fvd |
Frechet Video Distance (video-level) |
scripts/metrics/qualitative/compute_fvd.py |
dover |
Disentangled video quality |
scripts/metrics/control/compute_dover.py |
Geometrical consistency (Predict 2.5)
| Metric |
What |
tse |
Triangulation / Scene Evaluation โ stereo consistency |
cse |
Camera / Scene Evaluation โ pose-aware |
sampson |
Sampson distance โ epipolar-geometry fit |
Control fidelity (Transfer 2.5)
| Metric |
Matches |
canny_f1 |
Edge-control output vs. Canny of GT |
depth_rmse |
Depth-control output vs. GT depth |
seg_miou |
Seg-control output vs. GT segmentation |
blur_ssim |
Visibility / blur preservation |
VLM reasoning
| Metric |
What |
reason_critic |
Cosmos-Reason2 judges output quality |
reason_reward |
Cosmos-Reason1-7B-Reward scalar |
CLI
just evaluate fvd ./outputs/predict ./gt
just evaluate canny_f1 ./outputs/transfer ./gt
just evaluate reason_critic ./outputs/predict "" # no GT needed
Pick the right metric
| You're evaluating |
Use |
| Unconditional Predict 2.5 generation |
fid, fvd |
| Action-conditioned Predict 2.5 |
fvd, tse, reason_critic |
Transfer 2.5 with edge control |
canny_f1, fid |
Transfer 2.5 with depth control |
depth_rmse, fid |
Transfer 2.5 with seg control |
seg_miou, fid |
| VLM perception quality |
reason_critic, reason_reward |
Each metric writes a JSON report under output_dir:
{
"metric": "fvd",
"score": 118.4,
"n_samples": 32,
"pred_path": "./outputs/predict",
"gt_path": "./ground-truth",
"timestamp": "2026-05-07T12:34:56Z"
}
References