Qualifications

Qualifications let you build a per-workflow test suite: a set of persona-driven chat scenarios that exercise the workflow end to end, plus optional grading criteria that score how each run went.

Use qualifications to iterate on a workflow with confidence — before changing the agent prompt, the routing rules, or a tool definition, you can replay your scenario suite, see which runs still pass, and inspect any regressions in the conversation transcript.

There is no member enrollment, submission, or admin approval. Qualifications are an authoring and testing surface, not a credentialing system.

Key Concepts

Qualification

A qualification is a thin container that holds the scenarios for one workflow. Each workflow has at most one qualification, and the qualification is auto-created the first time you open the workflow's Evaluate view — you don't create them manually.

Persona

A persona is a reusable test identity scoped to your workspace. Personas are not real patients or members — they are templates for the simulated person on the other side of the conversation.

A persona has:

Name, description, optional age
Channel capabilities — whether the persona has a phone and/or email
Member role — which workspace role the persona maps to when instantiated
Identity — the prompt fed to the user-simulator LLM that "plays" this persona during automated runs. Write it in second person ("You are…") with personality traits, goals, and constraints
Default data records — pre-populated data the test member starts with, keyed by data-type slug

Personas live under the Personas tab of the Evaluate view. The same persona can be used across many scenarios and many workflows.

Scenario

A scenario is a specific test under a qualification. It composes a persona with the rest of what the test needs:

Name, optional description, slug (unique per qualification)
Persona — the reusable identity
Goal — what the persona is trying to achieve in this scenario
Channel — the channel the conversation runs on (phone, web, etc.)
Data records — scenario-specific seed data that overrides the persona's defaults per data-type slug
Grading criteria — zero or more pass/fail rules (see below)

Scenarios are ordered within a qualification and can be archived when no longer needed.

Grading Criteria

Each scenario can have any number of grading criteria. Two types:

Expression criteria — pass/fail using a CEL (Common Expression Language) expression evaluated automatically against the run's transcript and assignment data. Synchronous and free.
Rubric criteria — LLM-evaluated using a Jinja2 prompt template you write. The grader returns a 0.0–1.0 score, a pass/fail, and brief reasoning. Asynchronous and uses model spend.

Each criterion has a weight (used for the run's aggregate score) and a passing threshold (used by rubric criteria to decide pass/fail from the LLM score).

Test Chat Run

A test chat run is one execution of an automated test. Every run is anchored to a workflow; scenario linkage is optional:

Scenario-bound runs are launched from a scenario and can be evaluated against its criteria.
Ad-hoc runs are launched directly from the Test tab with just a persona — no scenario, no grading, just a quick chat to poke at the workflow.

Each run records the chat, the assignment, start/complete/fail timestamps, and (after evaluation) an aggregate score and pass/fail.

A run is driven in one of two modes:

Auto — an LLM user-simulator plays the persona end to end.
Human — you (the author) type as the persona in the Test tab. Useful when you want to drive the conversation by hand.

The Evaluate View

Open a workflow and switch to the Evaluate view. The view has four tabs:

Test — start a new run, scenario-bound or ad-hoc, and watch the conversation live. Use this for quick iteration while authoring.
Personas — workspace-scoped persona CRUD. Create a persona once, reuse it across every workflow.
Scenarios — the scenarios for this workflow's qualification. Author goals, seed data, and grading criteria here.
Runs — history of test chat runs, grouped by scenario. Each row shows the score, pass/fail, and an Evaluate button to (re)run grading on a completed run.

Setting Up a Test Suite

The typical authoring loop:

1. Create a Persona

Go to Personas in the Evaluate view, click Create, and fill in:

A name and description
Optional age
Whether the persona has phone/email
Which member role they map to
The identity prompt (second person — "You are…")
Optional default data records keyed by data-type slug

Personas are workspace-scoped, so this only happens once per persona.

2. Create a Scenario

In the Scenarios tab, click Create Scenario and:

Pick the persona you just made (or any existing one)
Write the goal — what is the persona trying to do in this scenario?
Pick the channel
Optionally seed data records that override the persona's defaults
Add grading criteria (see next step)

3. Add Grading Criteria

For each criterion:

Pick Expression or Rubric
Give it a name and optional description
For expression: write a CEL expression (see CEL Reference)
For rubric: write a Jinja2 prompt template (see Rubric Reference)
Set the weight (for aggregate scoring) and, for rubric criteria, the passing threshold

You can add criteria later — a scenario with no criteria runs fine, you just won't get a score.

4. Run It

From the Test tab, pick the scenario and start a run. Choose:

Auto driver — let the LLM user-simulator play the persona end to end.
Human driver — drive the conversation yourself by typing as the persona.

Watch the chat unfold. When the run completes, you can fire grading from the Runs tab.

Evaluation Pipeline

When you trigger evaluation on a completed run:

Expression criteria run first. They are deterministic and free.
If any expression criterion fails, rubric criteria are skipped. Rubric calls cost LLM money — there's no point asking the grader for nuance when a hard rule already failed.
Rubric criteria run otherwise. Each one is rendered against the same context as expressions, sent to the grader, and the grader returns a score, pass/fail, and reasoning.
Aggregate score and pass/fail are written back to the run. overall_score is the weighted mean of all scored criteria; passed is true only when every criterion passed.

Each criterion produces one CriterionResult row with its score, pass/fail, and (for rubric criteria) the grader's reasoning.

Only scenario-bound runs are evaluatable. Ad-hoc runs have no criteria and can't be graded.

Writing Expression Criteria (CEL)

Expression criteria are CEL expressions that evaluate to true (pass) or false (fail). They run against a single context object built from the completed run.

Available Context Variables

Variable	Type	Description
`assignment`	object	`{status, outcome, is_test}` — the assignment that the run produced
`chat`	object	`{channel, message_count, duration_seconds}` — chat-level summary
`assignment_tasks`	object	`{visited: [str], visited_count: int}` — which tasks (steps) were visited
`messages`	list	Flattened transcript — see below
`scenario`	object	`{slug, name, goal, channel}`
`testChatRun`	object	`{driverMode: "auto"\|"human", driverMemberId: int\|null}`

The `messages` list

The most powerful variable for tool-call and ordering checks. Each entry has:

Field	Type	Description
`index`	int	Position in the conversation (0-based)
`role`	string	`"user"`, `"assistant"`, `"operator"`, or `"tool_result"`
`content`	string	Text content
`tool_calls`	list	Tool calls made in this message; each has `{name}`
`tool_name`	string	For `tool_result` entries, the name of the tool that produced this result

All fields are always present (empty string or empty list when not applicable), so you can access any field without null checks.

Example Expressions

Outcome and shape checks

cel

cel

cel

cel

Tool call existence

The agent must have called forward_call at some point:

cel

Ordering — tool A before tool B

Identity must be verified before medical info is shared:

cel

Negative assertions

The agent must never escalate:

cel

The agent must never use a forbidden word (case-insensitive):

cel

Tool result inspection

A specific tool returned a confirmed result:

cel

Counting

No more than 3 search calls per conversation:

cel

Combined real-world example

The agent must look up the appointment, then reschedule it, confirm the change to the caller, and never escalate:

cel

CEL Quick Reference

Function	Description
`size(list)` / `size(string)`	Length of a list or string
`string.contains("substr")`	True if the string contains the substring
`string.matches("regex")`	True if the string matches the regex; `(?i)` for case-insensitive
`list.exists(x, condition)`	True if any element satisfies the condition
`list.all(x, condition)`	True if every element satisfies the condition
`list.filter(x, condition)`	New list of elements that satisfy the condition
`"value" in list`	True if the value is contained in the list

Writing Rubric Criteria (Jinja2)

Rubric criteria use a Jinja2 template that is rendered with the same context as expressions, then sent to the LLM grader. The grader returns a score (0.0–1.0), a passed boolean, and short reasoning.

A rubric criterion passes when the grader's score is greater than or equal to the passing threshold you configure on the criterion (default 1.0).

Use rubric criteria for judgments that are too nuanced for an expression — empathy, tone, clinical clarity, adherence to a multi-step protocol.

Available Template Variables

The rubric prompt is rendered with the same context as expressions, plus a pretty transcript for easy iteration in templates:

Variable	Description
`transcript`	List of message dicts — `{role, content, tool_calls?, tool_name?}`
`assignment.status`	The assignment's final status
`assignment.outcome`	The assignment's outcome (e.g. `"success"`, `"failure"`)
`chat`	`{channel, message_count, duration_seconds}`
`scenario`	`{slug, name, goal, channel}`
`testChatRun`	`{driverMode, driverMemberId}`

The chat transcript is also automatically prepended to the prompt sent to the grader, so you don't have to embed it manually unless you want a custom format.

Example Rubric Prompts

Empathy and clarity

jinja2

Protocol adherence

jinja2

Custom transcript formatting

If you need to render the transcript yourself (e.g. to filter out tool calls):

jinja2

Permissions

Qualifications, scenarios, personas, and test chat runs all use the standard members scopes:

Action	Required Scope
View qualifications, scenarios, personas, runs	`members:read`
Create/edit qualifications, scenarios, personas; start runs; trigger evaluation	`members:write`

Tips

Start with one expression criterion. assignment.outcome == "success" catches most regressions. Add more as you find scenarios that pass for the wrong reasons.
Use messages.exists(...) for tool-call checks. It gives you ordering context and per-call argument inspection that aggregate variables can't.
Use (?i) in matches() for case-insensitive content checks.
Test your CEL expressions against a real run before relying on them. The Runs tab shows whether each criterion passed or failed.
Reach for rubric criteria sparingly. They cost LLM spend on every evaluation, and the grader is helpful but not infallible. Expressions are free and deterministic — prefer them when the judgment is mechanical.
Keep personas reusable. A persona shouldn't bake in a specific goal — that belongs on the scenario. The same "67-year-old caller refilling a prescription" persona can power happy-path, suspicious-caller, and angry-caller scenarios.

Test Data Safety

Test chat runs never touch real patient data:

Test members are always created fresh from the persona, marked as test-only, and isolated from production data.
The transcript builder reads from test assignments only.
Personas are templates, not real members.

Live Operator Mode — qualifications no longer gate routing; "who can answer" is configured on the workflow itself.
CEL Expressions — write expression criteria for automated evaluation.
Workflows — the workflows you build qualifications for.

Related Resources

Members

Add, invite, and manage the humans in your workspace — team and customers alike.

Importing Members

Import members in bulk via CSV upload or the API — format your file, handle duplicates, and troubleshoot errors.

Anonymous & Test Members

Manage unknown callers and sandbox accounts — review the anonymous queue, resolve to real members, expire stale entries, and create test members.

All Guides

Browse all available guides

Qualifications

Qualifications

Key Concepts

Qualification

Persona

Scenario

Grading Criteria

Test Chat Run

The Evaluate View

Setting Up a Test Suite

1. Create a Persona

2. Create a Scenario

3. Add Grading Criteria

4. Run It

Evaluation Pipeline

Writing Expression Criteria (CEL)

Available Context Variables

The messages list

Example Expressions

Outcome and shape checks

Tool call existence

Ordering — tool A before tool B

Negative assertions

Tool result inspection

Counting

Combined real-world example

CEL Quick Reference

Writing Rubric Criteria (Jinja2)

Available Template Variables

Example Rubric Prompts

Empathy and clarity

Protocol adherence

Custom transcript formatting

Permissions

Tips

Test Data Safety

Related

Related Resources

Members

Importing Members

Anonymous & Test Members

All Guides

The `messages` list