Failure Recovery¶
OSMO workflow failure modes and what to do about them.
Status vocabulary¶
- ✅
COMPLETED - 🔄
RUNNING - ⏳
PENDING- waiting for quota / scheduler - ❌
FAILED/FAILED_CANCELED/FAILED_EXEC_TIMEOUT/FAILED_SERVER_ERROR
Common patterns¶
Quota exhausted¶
OOM¶
agent("""
If the workflow fails with an OOM error, halve the memory request
in the YAML and resubmit.
""")
Validation error at submit time¶
The OSMO server rejects bad sizing with detailed assertions. The agent can read the error, adjust the resource fields, and retry without losing context.
Image pull failure¶
Almost always a registry-credential or image-tag issue. Use
osmo_workflow_logs to fetch the failure detail; the message will name the
missing image.