Qualifications
Author and run persona-driven test chat scenarios for a workflow, with optional automated grading.
Qualifications
Qualifications let you build a per-workflow test suite: a set of persona-driven chat scenarios that exercise the workflow end to end, plus optional grading criteria that score how each run went.
Use qualifications to iterate on a workflow with confidence — before changing the agent prompt, the routing rules, or a tool definition, you can replay your scenario suite, see which runs still pass, and inspect any regressions in the conversation transcript.
There is no member enrollment, submission, or admin approval. Qualifications are an authoring and testing surface, not a credentialing system.
Key Concepts
Qualification
A qualification is a thin container that holds the scenarios for one workflow. Each workflow has at most one qualification, and the qualification is auto-created the first time you open the workflow's Evaluate view — you don't create them manually.
Persona
A persona is a reusable test identity scoped to your workspace. Personas are not real patients or members — they are templates for the simulated person on the other side of the conversation.
A persona has:
- Name, description, optional age
- Channel capabilities — whether the persona has a phone and/or email
- Member role — which workspace role the persona maps to when instantiated
- Identity — the prompt fed to the user-simulator LLM that "plays" this persona during automated runs. Write it in second person ("You are…") with personality traits, goals, and constraints
- Default data records — pre-populated data the test member starts with, keyed by data-type slug
Personas live under the Personas tab of the Evaluate view. The same persona can be used across many scenarios and many workflows.
Scenario
A scenario is a specific test under a qualification. It composes a persona with the rest of what the test needs:
- Name, optional description, slug (unique per qualification)
- Persona — the reusable identity
- Goal — what the persona is trying to achieve in this scenario
- Channel — the channel the conversation runs on (phone, web, etc.)
- Data records — scenario-specific seed data that overrides the persona's defaults per data-type slug
- Grading criteria — zero or more pass/fail rules (see below)
Scenarios are ordered within a qualification and can be archived when no longer needed.
Grading Criteria
Each scenario can have any number of grading criteria. Two types:
- Expression criteria — pass/fail using a CEL (Common Expression Language) expression evaluated automatically against the run's transcript and assignment data. Synchronous and free.
- Rubric criteria — LLM-evaluated using a Jinja2 prompt template you write. The grader returns a 0.0–1.0 score, a pass/fail, and brief reasoning. Asynchronous and uses model spend.
Each criterion has a weight (used for the run's aggregate score) and a passing threshold (used by rubric criteria to decide pass/fail from the LLM score).
Test Chat Run
A test chat run is one execution of an automated test. Every run is anchored to a workflow; scenario linkage is optional:
- Scenario-bound runs are launched from a scenario and can be evaluated against its criteria.
- Ad-hoc runs are launched directly from the Test tab with just a persona — no scenario, no grading, just a quick chat to poke at the workflow.
Each run records the chat, the assignment, start/complete/fail timestamps, and (after evaluation) an aggregate score and pass/fail.
A run is driven in one of two modes:
- Auto — an LLM user-simulator plays the persona end to end.
- Human — you (the author) type as the persona in the Test tab. Useful when you want to drive the conversation by hand.
The Evaluate View
Open a workflow and switch to the Evaluate view. The view has four tabs:
- Test — start a new run, scenario-bound or ad-hoc, and watch the conversation live. Use this for quick iteration while authoring.
- Personas — workspace-scoped persona CRUD. Create a persona once, reuse it across every workflow.
- Scenarios — the scenarios for this workflow's qualification. Author goals, seed data, and grading criteria here.
- Runs — history of test chat runs, grouped by scenario. Each row shows the score, pass/fail, and an Evaluate button to (re)run grading on a completed run.
Setting Up a Test Suite
The typical authoring loop:
1. Create a Persona
Go to Personas in the Evaluate view, click Create, and fill in:
- A name and description
- Optional age
- Whether the persona has phone/email
- Which member role they map to
- The identity prompt (second person — "You are…")
- Optional default data records keyed by data-type slug
Personas are workspace-scoped, so this only happens once per persona.
2. Create a Scenario
In the Scenarios tab, click Create Scenario and:
- Pick the persona you just made (or any existing one)
- Write the goal — what is the persona trying to do in this scenario?
- Pick the channel
- Optionally seed data records that override the persona's defaults
- Add grading criteria (see next step)
3. Add Grading Criteria
For each criterion:
- Pick Expression or Rubric
- Give it a name and optional description
- For expression: write a CEL expression (see CEL Reference)
- For rubric: write a Jinja2 prompt template (see Rubric Reference)
- Set the weight (for aggregate scoring) and, for rubric criteria, the passing threshold
You can add criteria later — a scenario with no criteria runs fine, you just won't get a score.
4. Run It
From the Test tab, pick the scenario and start a run. Choose:
- Auto driver — let the LLM user-simulator play the persona end to end.
- Human driver — drive the conversation yourself by typing as the persona.
Watch the chat unfold. When the run completes, you can fire grading from the Runs tab.
Evaluation Pipeline
When you trigger evaluation on a completed run:
- Expression criteria run first. They are deterministic and free.
- If any expression criterion fails, rubric criteria are skipped. Rubric calls cost LLM money — there's no point asking the grader for nuance when a hard rule already failed.
- Rubric criteria run otherwise. Each one is rendered against the same context as expressions, sent to the grader, and the grader returns a score, pass/fail, and reasoning.
- Aggregate score and pass/fail are written back to the run.
overall_scoreis the weighted mean of all scored criteria;passedis true only when every criterion passed.
Each criterion produces one CriterionResult row with its score, pass/fail, and (for rubric criteria) the grader's reasoning.
Only scenario-bound runs are evaluatable. Ad-hoc runs have no criteria and can't be graded.
Writing Expression Criteria (CEL)
Expression criteria are CEL expressions that evaluate to true (pass) or false (fail). They run against a single context object built from the completed run.
Available Context Variables
| Variable | Type | Description |
|---|---|---|
assignment | object | {status, outcome, is_test} — the assignment that the run produced |
chat | object | {channel, message_count, duration_seconds} — chat-level summary |
assignment_tasks | object | {visited: [str], visited_count: int} — which tasks (steps) were visited |
messages | list | Flattened transcript — see below |
scenario | object | {slug, name, goal, channel} |
testChatRun | object | {driverMode: "auto"|"human", driverMemberId: int|null} |
The messages list
The most powerful variable for tool-call and ordering checks. Each entry has:
| Field | Type | Description |
|---|---|---|
index | int | Position in the conversation (0-based) |
role | string | "user", "assistant", "operator", or "tool_result" |
content | string | Text content |
tool_calls | list | Tool calls made in this message; each has {name} |
tool_name | string | For tool_result entries, the name of the tool that produced this result |
All fields are always present (empty string or empty list when not applicable), so you can access any field without null checks.
Example Expressions
Outcome and shape checks
cel
cel
cel
cel
Tool call existence
The agent must have called forward_call at some point:
cel
Ordering — tool A before tool B
Identity must be verified before medical info is shared:
cel
Negative assertions
The agent must never escalate:
cel
The agent must never use a forbidden word (case-insensitive):
cel
Tool result inspection
A specific tool returned a confirmed result:
cel
Counting
No more than 3 search calls per conversation:
cel
Combined real-world example
The agent must look up the appointment, then reschedule it, confirm the change to the caller, and never escalate:
cel
CEL Quick Reference
| Function | Description |
|---|---|
size(list) / size(string) | Length of a list or string |
string.contains("substr") | True if the string contains the substring |
string.matches("regex") | True if the string matches the regex; (?i) for case-insensitive |
list.exists(x, condition) | True if any element satisfies the condition |
list.all(x, condition) | True if every element satisfies the condition |
list.filter(x, condition) | New list of elements that satisfy the condition |
"value" in list | True if the value is contained in the list |
Writing Rubric Criteria (Jinja2)
Rubric criteria use a Jinja2 template that is rendered with the same context as expressions, then sent to the LLM grader. The grader returns a score (0.0–1.0), a passed boolean, and short reasoning.
A rubric criterion passes when the grader's score is greater than or equal to the passing threshold you configure on the criterion (default 1.0).
Use rubric criteria for judgments that are too nuanced for an expression — empathy, tone, clinical clarity, adherence to a multi-step protocol.
Available Template Variables
The rubric prompt is rendered with the same context as expressions, plus a pretty transcript for easy iteration in templates:
| Variable | Description |
|---|---|
transcript | List of message dicts — {role, content, tool_calls?, tool_name?} |
assignment.status | The assignment's final status |
assignment.outcome | The assignment's outcome (e.g. "success", "failure") |
chat | {channel, message_count, duration_seconds} |
scenario | {slug, name, goal, channel} |
testChatRun | {driverMode, driverMemberId} |
The chat transcript is also automatically prepended to the prompt sent to the grader, so you don't have to embed it manually unless you want a custom format.
Example Rubric Prompts
Empathy and clarity
jinja2
Protocol adherence
jinja2
Custom transcript formatting
If you need to render the transcript yourself (e.g. to filter out tool calls):
jinja2
Permissions
Qualifications, scenarios, personas, and test chat runs all use the standard members scopes:
| Action | Required Scope |
|---|---|
| View qualifications, scenarios, personas, runs | members:read |
| Create/edit qualifications, scenarios, personas; start runs; trigger evaluation | members:write |
Tips
- Start with one expression criterion.
assignment.outcome == "success"catches most regressions. Add more as you find scenarios that pass for the wrong reasons. - Use
messages.exists(...)for tool-call checks. It gives you ordering context and per-call argument inspection that aggregate variables can't. - Use
(?i)inmatches()for case-insensitive content checks. - Test your CEL expressions against a real run before relying on them. The Runs tab shows whether each criterion passed or failed.
- Reach for rubric criteria sparingly. They cost LLM spend on every evaluation, and the grader is helpful but not infallible. Expressions are free and deterministic — prefer them when the judgment is mechanical.
- Keep personas reusable. A persona shouldn't bake in a specific goal — that belongs on the scenario. The same "67-year-old caller refilling a prescription" persona can power happy-path, suspicious-caller, and angry-caller scenarios.
Test Data Safety
Test chat runs never touch real patient data:
- Test members are always created fresh from the persona, marked as test-only, and isolated from production data.
- The transcript builder reads from test assignments only.
- Personas are templates, not real members.
Related
- Live Operator Mode — qualifications no longer gate routing; "who can answer" is configured on the workflow itself.
- CEL Expressions — write expression criteria for automated evaluation.
- Workflows — the workflows you build qualifications for.
Related Resources
Members
Add, invite, and manage the humans in your workspace — team and customers alike.
Importing Members
Import members in bulk via CSV upload or the API — format your file, handle duplicates, and troubleshoot errors.
Anonymous & Test Members
Manage unknown callers and sandbox accounts — review the anonymous queue, resolve to real members, expire stale entries, and create test members.
All Guides
Browse all available guides