Prompt Scoring

View as Markdown

Prompt Scoring analyses your agent’s system prompt across 11 quality dimensions and returns an overall score (0–100), a grade, and actionable feedback for each dimension. Use it to catch weak spots before you go live.

Location: Agent editor → Prompt tab → score badge at the bottom of the editor


Seeing Your Score

When you open the Prompt tab of any single-prompt agent, the editor shows a score badge at the bottom of the prompt area. A coloured label — Strong prompt, Good prompt, Weak prompt, or Poor prompt — tells you the current grade at a glance.

Weak prompt badge with View issues link

Weak prompt badge in the agent editor — click 'View issues' to open the analysis panel.

Click View issues to open the Prompt Analysis side panel.


The Analysis Panel

The panel breaks down your score into individual dimension cards. Each card shows the dimension name, its quality level, and a quoted excerpt from your prompt as evidence.

Prompt analysis panel

Prompt analysis panel showing the overall score, latency estimate, and per-dimension findings.
Panel fieldWhat it means
Overall gradeNumeric score (0–100) and label — Excellent / Good / Needs Work / Poor
First token latencyEstimated TTFT overhead added by your prompt length
Weak prompt / Strong promptSummary label derived from the overall score
Low / Normal / High latencyToken-density band for the current prompt

How Scoring Works

Each time you request a score, Atoms sends your prompt through two sequential Gemini passes — a Platform Analyst pass and a Rubric Judge pass — and returns scores for 11 dimensions grouped into three priority tiers.

Tiers and dimensions

DimensionWhat is checked
Role & ObjectiveIs the agent’s purpose and scope clearly defined?
Personality & VoiceAre tone and style specified with concrete guidance, not just adjectives?
Conversation StructureAre there explicit flow phases and exit criteria?
Tool IntegrationAre all referenced tools declared, with failure paths covered?
Constraints & SafetyAre safety rules, escalation paths, and hard limits spelled out?

Each dimension is rated Strong, Adequate, Weak, Missing, or Not Applicable.


Quality Levels and Token Bands

Grade thresholds

ScoreGrade
90–100Excellent
75–89Good
50–74Needs Work
0–49Poor

Token density bands

The First token latency estimate and the density band are derived from your prompt’s token count.

BandToken rangeLatency impact
LeanFewer than 4K tokensVery low
Normal4K–9.9K tokensLow
Heavy10K–14.9K tokensModerate
Overweight15K or more tokensHigh

For voice agents, aim for the Normal band. Heavy and Overweight prompts increase first-token latency, which makes responses feel slower to callers.


Scoring via API

You can trigger scoring programmatically against a published version or a draft. Each call deducts 1 credit. Re-submitting an unchanged prompt returns a 400 — retrieve the cached result via GET /agent/{id} instead.

$# Score a published version
$curl -X POST https://api.smallest.ai/atoms/v1/prompt-scoring/score \
> -H "Authorization: Bearer $SMALLEST_API_KEY" \
> -H "Content-Type: application/json" \
> -d '{"versionId": "6a1589b75e048394eb37bc47"}'
$
$# Score a draft
$curl -X POST https://api.smallest.ai/atoms/v1/prompt-scoring/score \
> -H "Authorization: Bearer $SMALLEST_API_KEY" \
> -H "Content-Type: application/json" \
> -d '{"draftId": "6a1589b75e048394eb37bc48"}'

The response includes overall_score, overall_grade, band, estimated_ttft_overhead_ms, and a dimensions array with one entry per scored dimension.

Prompt scoring is only supported for single-prompt agents. Conversational flow (workflow-graph) agents return a 400.

Full API reference