Skip to content

Failure Recovery

OSMO workflow failure modes and what to do about them.

Status vocabulary

  • COMPLETED
  • 🔄 RUNNING
  • PENDING - waiting for quota / scheduler
  • FAILED / FAILED_CANCELED / FAILED_EXEC_TIMEOUT / FAILED_SERVER_ERROR

Common patterns

Quota exhausted

agent("""
Submit train.yaml. If it fails because quota is full,
re-submit with priority=LOW.
""")

OOM

agent("""
If the workflow fails with an OOM error, halve the memory request
in the YAML and resubmit.
""")

Validation error at submit time

The OSMO server rejects bad sizing with detailed assertions. The agent can read the error, adjust the resource fields, and retry without losing context.

Image pull failure

Almost always a registry-credential or image-tag issue. Use osmo_workflow_logs to fetch the failure detail; the message will name the missing image.