Error Handling in Agent Crews
When a deployed crew’s code raises an exception, you want to know about it — on the call record, in the dashboard, and in webhooks. Not buried in pod stdout where only kubectl logs can find it.
The Agent Crew SDK closes this gap with four cooperating layers, each catching errors earlier than the next:
You don’t need to opt into any of these — they’re on by default. The sections below cover when to interact with each one.
Layer 1 — Pre-deploy env var validation
smallestai agent-crew deploy AST-walks every .py file in your build directory and warns about every environment variable your code reads:
SMALLEST_API_KEY is suppressed from this list — the platform always sets it. Anything else is your responsibility to provide.
The warning is purely informational — the deploy proceeds either way. We list the env vars so you can decide whether to bake them into your Docker image, switch to a credential-free alternative, or accept that the first call will fail.
Layer 2 — Pod startup health check
AtomsCrewApp exposes two HTTP endpoints alongside its WebSocket:
The platform’s readiness probe polls /ready. If your setup handler raises during startup (because, say, OpenAIClient.__init__ validates the API key and yours is missing), the pod stays not-ready and the build is auto-marked broken. No traffic ever reaches a crew that can’t initialize.
What “ready” means
When the FastAPI server starts, it runs a dry-run of your setup_handler against a no-op session. This exercises every node’s __init__, add_node(), and add_edge() paths — but doesn’t run any session logic. If everything constructs cleanly, the pod flips to ready. If any node raises, the pod stays at 503 and the reason is in the response body:
This means: when you pip install -r requirements.txt && python app.py locally, the pod is already exercising the same startup path the platform uses. If /ready is 200 locally with the same env vars, it’ll be 200 in the cloud.
Layer 3 — Auto-emit on user-code exceptions
Whenever your process_event() or generate_response() raises an uncaught exception, the SDK automatically:
- Catches the exception inside the event-handler loop (so one bad event doesn’t kill the node).
- Emits an
SDKAgentErrorEventupstream with the message, full traceback, and the exception’s class name. - Logs the exception locally.
- Continues processing the next event.
The orchestrator records the error on the call:
calllog.errors[]gets a new entry with{source, severity, message, payload, timestamp}— wherepayloadcarriesnode_name,error_class,traceback, and any custom keys you added.- If severity is
"fatal",calllog.failureReasonis populated and the call is terminated cleanly. - The Events tab in the dashboard renders each entry inline on the call timeline.
- Webhooks (
post-conversation) include the errors in their payload.
Severity semantics
SDKAgentErrorEvent.severity controls whether the call is failed:
You can override the default by setting severity= on an event you emit manually (see below), or by overriding _error_severity() on your node class.
Layer 4 — Emitting errors explicitly
The auto-emission only fires when your code lets an exception bubble up out of process_event / generate_response. If you catch your own exception for a graceful fallback, the SDK won’t see it — you need to emit the error event manually so it’s still recorded.
This is the recommended pattern for background nodes that have a sensible fallback:
For output nodes, the existing wrapper in OutputCrewNode._handle_llm_request already auto-emits with severity="fatal" — you only need manual emission if you’re catching errors yourself.
Customer-application errors (not exceptions)
You can also emit SDKAgentErrorEvent for business-logic errors that aren’t actually Python exceptions:
This lets you reuse the same error-record / call-failure pipeline for application-level validation failures without inventing a separate event type.
Complete example
Putting all four layers together — what happens when a customer deploys a crew that references OPENAI_API_KEY but the env var isn’t set in the pod:
At deploy
CLI prints the env-var warning. Customer acknowledges and proceeds anyway (to test the platform).
At pod startup
OpenAIClient.__init__ raises (it validates the key shape). /ready returns 503. The build is auto-marked broken; no calls are routed.
If the platform somehow routes a call
The first _analyze_sentiment raises. SDK auto-emits SDKAgentErrorEvent(severity="warning", payload={"node_name": "sentiment-analyzer", "error_class": "OpenAIError", "traceback": "..."}). The call continues with current_sentiment = "neutral" (because the customer’s try/except handles the fallback).
Reference
SDKAgentErrorEvent schema
Conventional payload keys (auto-populated by the SDK when it emits on your behalf):
When emitting manually you can include any other keys (customer_id, request_id, attempt_number, etc.) — they’re stored alongside the conventional keys and queryable end-to-end.

