Error Handling in Agent Crews

View as Markdown

When a deployed crew’s code raises an exception, you want to know about it — on the call record, in the dashboard, and in webhooks. Not buried in pod stdout where only kubectl logs can find it.

The Agent Crew SDK closes this gap with four cooperating layers, each catching errors earlier than the next:

LayerCatchesWhen
1. Pre-deploy validationMissing env var refs in your codesmallestai agent-crew deploy
2. Pod startup health check__init__ exceptions (missing API keys, bad imports)Pod boot, before accepting calls
3. Auto-emit on user-code exceptionsRuntime errors in process_event / generate_responseFirst time the code path runs
4. Orchestrator routingAll emitted errorsReal-time during the call

You don’t need to opt into any of these — they’re on by default. The sections below cover when to interact with each one.


Layer 1 — Pre-deploy env var validation

smallestai agent-crew deploy AST-walks every .py file in your build directory and warns about every environment variable your code reads:

$$ smallestai agent-crew deploy --entry-point app.py
$Deploying agent from: /Users/me/projects/my-crew
$Entry point: app.py
$Agent ID: 6931c89118641b32aa6b5818
$
$⚠ Your code references these environment variables:
$ • OPENAI_API_KEY
$ • POSTGRES_URL
$
$The deploy pipeline does not propagate `.env` values into the running
$pod yet. If these aren't set in the runtime image, your code will
>fail at first use (e.g. `openai.OpenAIError: Missing credentials`).
>
>Packaging agent code...
>✓ Package created (0.31 MB)
>Uploading to Atoms platform...
>✓ Deployment successful!

SMALLEST_API_KEY is suppressed from this list — the platform always sets it. Anything else is your responsibility to provide.

The warning is purely informational — the deploy proceeds either way. We list the env vars so you can decide whether to bake them into your Docker image, switch to a credential-free alternative, or accept that the first call will fail.


Layer 2 — Pod startup health check

AtomsCrewApp exposes two HTTP endpoints alongside its WebSocket:

EndpointPurposeReturns
GET /healthLiveness — the process is upAlways 200 if the pod is running
GET /readyReadiness — the crew is initialized and ready to accept calls503 with structured reason until _validate_startup() passes; 200 after

The platform’s readiness probe polls /ready. If your setup handler raises during startup (because, say, OpenAIClient.__init__ validates the API key and yours is missing), the pod stays not-ready and the build is auto-marked broken. No traffic ever reaches a crew that can’t initialize.

What “ready” means

When the FastAPI server starts, it runs a dry-run of your setup_handler against a no-op session. This exercises every node’s __init__, add_node(), and add_edge() paths — but doesn’t run any session logic. If everything constructs cleanly, the pod flips to ready. If any node raises, the pod stays at 503 and the reason is in the response body:

$$ curl http://localhost:8080/ready
${"detail":{"status":"not_ready","reason":"OpenAIError: Missing API key"}}

This means: when you pip install -r requirements.txt && python app.py locally, the pod is already exercising the same startup path the platform uses. If /ready is 200 locally with the same env vars, it’ll be 200 in the cloud.


Layer 3 — Auto-emit on user-code exceptions

Whenever your process_event() or generate_response() raises an uncaught exception, the SDK automatically:

  1. Catches the exception inside the event-handler loop (so one bad event doesn’t kill the node).
  2. Emits an SDKAgentErrorEvent upstream with the message, full traceback, and the exception’s class name.
  3. Logs the exception locally.
  4. Continues processing the next event.

The orchestrator records the error on the call:

  • calllog.errors[] gets a new entry with {source, severity, message, payload, timestamp} — where payload carries node_name, error_class, traceback, and any custom keys you added.
  • If severity is "fatal", calllog.failureReason is populated and the call is terminated cleanly.
  • The Events tab in the dashboard renders each entry inline on the call timeline.
  • Webhooks (post-conversation) include the errors in their payload.

Severity semantics

SDKAgentErrorEvent.severity controls whether the call is failed:

ValueDefault forEffect
"warning"BackgroundCrewNode subclassesRecorded on the call, call continues. Use for non-critical side-channels — sentiment, audit, observability — where the failure doesn’t break the conversation.
"fatal"OutputCrewNode subclassesRecorded on the call, call ends, failureReason is populated. Use when the agent itself fails — no point continuing if the user-facing brain isn’t responding.

You can override the default by setting severity= on an event you emit manually (see below), or by overriding _error_severity() on your node class.


Layer 4 — Emitting errors explicitly

The auto-emission only fires when your code lets an exception bubble up out of process_event / generate_response. If you catch your own exception for a graceful fallback, the SDK won’t see it — you need to emit the error event manually so it’s still recorded.

This is the recommended pattern for background nodes that have a sensible fallback:

1from smallestai.atoms.crew.events import SDKAgentErrorEvent
2from smallestai.atoms.crew.nodes import BackgroundCrewNode
3
4class SentimentAnalyzer(BackgroundCrewNode):
5 async def _analyze_sentiment(self, text: str):
6 try:
7 sentiment = await self._call_openai(text)
8 except Exception as e:
9 logger.error(f"[SentimentAnalyzer] {e}")
10 # Fall back to neutral so the conversation continues
11 self.current_sentiment = "neutral"
12 # But still record the error on the call
13 await self.send_event(SDKAgentErrorEvent(
14 message=str(e),
15 severity="warning",
16 payload={
17 "node_name": self.name,
18 "error_class": type(e).__name__,
19 },
20 ))
21 return
22 self.current_sentiment = sentiment

For output nodes, the existing wrapper in OutputCrewNode._handle_llm_request already auto-emits with severity="fatal" — you only need manual emission if you’re catching errors yourself.

Customer-application errors (not exceptions)

You can also emit SDKAgentErrorEvent for business-logic errors that aren’t actually Python exceptions:

1if not customer_consent_recorded:
2 await self.send_event(SDKAgentErrorEvent(
3 message="Cannot process request without recorded consent",
4 severity="fatal",
5 payload={
6 "node_name": self.name,
7 "error_class": "ConsentMissingError",
8 "customer_id": customer_id, # any custom keys are welcome
9 "request_id": request_id,
10 },
11 ))
12 return

This lets you reuse the same error-record / call-failure pipeline for application-level validation failures without inventing a separate event type.


Complete example

Putting all four layers together — what happens when a customer deploys a crew that references OPENAI_API_KEY but the env var isn’t set in the pod:

1

At deploy

CLI prints the env-var warning. Customer acknowledges and proceeds anyway (to test the platform).

2

At pod startup

OpenAIClient.__init__ raises (it validates the key shape). /ready returns 503. The build is auto-marked broken; no calls are routed.

3

If the platform somehow routes a call

The first _analyze_sentiment raises. SDK auto-emits SDKAgentErrorEvent(severity="warning", payload={"node_name": "sentiment-analyzer", "error_class": "OpenAIError", "traceback": "..."}). The call continues with current_sentiment = "neutral" (because the customer’s try/except handles the fallback).

4

In the dashboard

The call’s Events tab shows the warning inline. calllog.errors[] has the structured entry. The customer fixes the env var without ever opening pod logs.

5

If the output agent has the same failure

severity="fatal" instead of "warning". The call terminates. failureReason is set to "OpenAIError in support-agent.generate_response: Missing credentials". The call detail page surfaces it prominently.


Reference

SDKAgentErrorEvent schema

FieldTypeDefaultNotes
messagestr(required)Human-readable description
severity"warning" | "fatal""warning"See Severity semantics. Drives whether the call is failed.
payloadDict[str, Any]{}Free-form. Stored as JSON on the call record so any keys are queryable downstream.

Conventional payload keys (auto-populated by the SDK when it emits on your behalf):

KeyNotes
node_nameWhich CrewNode raised. Typically self.name.
error_classThe exception class — e.g. "OpenAIError", "ConnectionError"
tracebackFull traceback string from traceback.format_exc()

When emitting manually you can include any other keys (customer_id, request_id, attempt_number, etc.) — they’re stored alongside the conventional keys and queryable end-to-end.

Endpoints exposed by AtomsCrewApp

MethodPathPurpose
GET/Status snapshot — {status, ready, active_sessions}
GET/healthLiveness probe — 200 if process up
GET/readyReadiness probe — 200 if startup validation passed, 503 with reason otherwise
WebSocket/wsAgent session connection (used by the platform orchestrator)