Qualifications

Author and run persona-driven test chat scenarios for a workflow, with optional automated grading.

Intermediate
12 min read

Qualifications

Qualifications let you build a per-workflow test suite: a set of persona-driven chat scenarios that exercise the workflow end to end, plus optional grading criteria that score how each run went.

Use qualifications to iterate on a workflow with confidence — before changing the agent prompt, the routing rules, or a tool definition, you can replay your scenario suite, see which runs still pass, and inspect any regressions in the conversation transcript.

There is no member enrollment, submission, or admin approval. Qualifications are an authoring and testing surface, not a credentialing system.

Key Concepts

Qualification

A qualification is a thin container that holds the scenarios for one workflow. Each workflow has at most one qualification, and the qualification is auto-created the first time you open the workflow's Evaluate view — you don't create them manually.

Persona

A persona is a reusable test identity scoped to your workspace. Personas are not real patients or members — they are templates for the simulated person on the other side of the conversation.

A persona has:

  • Name, description, optional age
  • Channel capabilities — whether the persona has a phone and/or email
  • Member role — which workspace role the persona maps to when instantiated
  • Identity — the prompt fed to the user-simulator LLM that "plays" this persona during automated runs. Write it in second person ("You are…") with personality traits, goals, and constraints
  • Default data records — pre-populated data the test member starts with, keyed by data-type slug

Personas live under the Personas tab of the Evaluate view. The same persona can be used across many scenarios and many workflows.

Scenario

A scenario is a specific test under a qualification. It composes a persona with the rest of what the test needs:

  • Name, optional description, slug (unique per qualification)
  • Persona — the reusable identity
  • Goal — what the persona is trying to achieve in this scenario
  • Channel — the channel the conversation runs on (phone, web, etc.)
  • Data records — scenario-specific seed data that overrides the persona's defaults per data-type slug
  • Grading criteria — zero or more pass/fail rules (see below)

Scenarios are ordered within a qualification and can be archived when no longer needed.

Grading Criteria

Each scenario can have any number of grading criteria. Two types:

  • Expression criteria — pass/fail using a CEL (Common Expression Language) expression evaluated automatically against the run's transcript and assignment data. Synchronous and free.
  • Rubric criteria — LLM-evaluated using a Jinja2 prompt template you write. The grader returns a 0.0–1.0 score, a pass/fail, and brief reasoning. Asynchronous and uses model spend.

Each criterion has a weight (used for the run's aggregate score) and a passing threshold (used by rubric criteria to decide pass/fail from the LLM score).

Test Chat Run

A test chat run is one execution of an automated test. Every run is anchored to a workflow; scenario linkage is optional:

  • Scenario-bound runs are launched from a scenario and can be evaluated against its criteria.
  • Ad-hoc runs are launched directly from the Test tab with just a persona — no scenario, no grading, just a quick chat to poke at the workflow.

Each run records the chat, the assignment, start/complete/fail timestamps, and (after evaluation) an aggregate score and pass/fail.

A run is driven in one of two modes:

  • Auto — an LLM user-simulator plays the persona end to end.
  • Human — you (the author) type as the persona in the Test tab. Useful when you want to drive the conversation by hand.

The Evaluate View

Open a workflow and switch to the Evaluate view. The view has four tabs:

  • Test — start a new run, scenario-bound or ad-hoc, and watch the conversation live. Use this for quick iteration while authoring.
  • Personas — workspace-scoped persona CRUD. Create a persona once, reuse it across every workflow.
  • Scenarios — the scenarios for this workflow's qualification. Author goals, seed data, and grading criteria here.
  • Runs — history of test chat runs, grouped by scenario. Each row shows the score, pass/fail, and an Evaluate button to (re)run grading on a completed run.

Setting Up a Test Suite

The typical authoring loop:

1. Create a Persona

Go to Personas in the Evaluate view, click Create, and fill in:

  • A name and description
  • Optional age
  • Whether the persona has phone/email
  • Which member role they map to
  • The identity prompt (second person — "You are…")
  • Optional default data records keyed by data-type slug

Personas are workspace-scoped, so this only happens once per persona.

2. Create a Scenario

In the Scenarios tab, click Create Scenario and:

  1. Pick the persona you just made (or any existing one)
  2. Write the goal — what is the persona trying to do in this scenario?
  3. Pick the channel
  4. Optionally seed data records that override the persona's defaults
  5. Add grading criteria (see next step)

3. Add Grading Criteria

For each criterion:

  1. Pick Expression or Rubric
  2. Give it a name and optional description
  3. For expression: write a CEL expression (see CEL Reference)
  4. For rubric: write a Jinja2 prompt template (see Rubric Reference)
  5. Set the weight (for aggregate scoring) and, for rubric criteria, the passing threshold

You can add criteria later — a scenario with no criteria runs fine, you just won't get a score.

4. Run It

From the Test tab, pick the scenario and start a run. Choose:

  • Auto driver — let the LLM user-simulator play the persona end to end.
  • Human driver — drive the conversation yourself by typing as the persona.

Watch the chat unfold. When the run completes, you can fire grading from the Runs tab.

Evaluation Pipeline

When you trigger evaluation on a completed run:

  1. Expression criteria run first. They are deterministic and free.
  2. If any expression criterion fails, rubric criteria are skipped. Rubric calls cost LLM money — there's no point asking the grader for nuance when a hard rule already failed.
  3. Rubric criteria run otherwise. Each one is rendered against the same context as expressions, sent to the grader, and the grader returns a score, pass/fail, and reasoning.
  4. Aggregate score and pass/fail are written back to the run. overall_score is the weighted mean of all scored criteria; passed is true only when every criterion passed.

Each criterion produces one CriterionResult row with its score, pass/fail, and (for rubric criteria) the grader's reasoning.

Only scenario-bound runs are evaluatable. Ad-hoc runs have no criteria and can't be graded.

Writing Expression Criteria (CEL)

Expression criteria are CEL expressions that evaluate to true (pass) or false (fail). They run against a single context object built from the completed run.

Available Context Variables

VariableTypeDescription
assignmentobject{status, outcome, is_test} — the assignment that the run produced
chatobject{channel, message_count, duration_seconds} — chat-level summary
assignment_tasksobject{visited: [str], visited_count: int} — which tasks (steps) were visited
messageslistFlattened transcript — see below
scenarioobject{slug, name, goal, channel}
testChatRunobject{driverMode: "auto"|"human", driverMemberId: int|null}

The messages list

The most powerful variable for tool-call and ordering checks. Each entry has:

FieldTypeDescription
indexintPosition in the conversation (0-based)
rolestring"user", "assistant", "operator", or "tool_result"
contentstringText content
tool_callslistTool calls made in this message; each has {name}
tool_namestringFor tool_result entries, the name of the tool that produced this result

All fields are always present (empty string or empty list when not applicable), so you can access any field without null checks.

Example Expressions

Outcome and shape checks

cel
cel
cel
cel

Tool call existence

The agent must have called forward_call at some point:

cel

Ordering — tool A before tool B

Identity must be verified before medical info is shared:

cel

Negative assertions

The agent must never escalate:

cel

The agent must never use a forbidden word (case-insensitive):

cel

Tool result inspection

A specific tool returned a confirmed result:

cel

Counting

No more than 3 search calls per conversation:

cel

Combined real-world example

The agent must look up the appointment, then reschedule it, confirm the change to the caller, and never escalate:

cel

CEL Quick Reference

FunctionDescription
size(list) / size(string)Length of a list or string
string.contains("substr")True if the string contains the substring
string.matches("regex")True if the string matches the regex; (?i) for case-insensitive
list.exists(x, condition)True if any element satisfies the condition
list.all(x, condition)True if every element satisfies the condition
list.filter(x, condition)New list of elements that satisfy the condition
"value" in listTrue if the value is contained in the list

Writing Rubric Criteria (Jinja2)

Rubric criteria use a Jinja2 template that is rendered with the same context as expressions, then sent to the LLM grader. The grader returns a score (0.0–1.0), a passed boolean, and short reasoning.

A rubric criterion passes when the grader's score is greater than or equal to the passing threshold you configure on the criterion (default 1.0).

Use rubric criteria for judgments that are too nuanced for an expression — empathy, tone, clinical clarity, adherence to a multi-step protocol.

Available Template Variables

The rubric prompt is rendered with the same context as expressions, plus a pretty transcript for easy iteration in templates:

VariableDescription
transcriptList of message dicts — {role, content, tool_calls?, tool_name?}
assignment.statusThe assignment's final status
assignment.outcomeThe assignment's outcome (e.g. "success", "failure")
chat{channel, message_count, duration_seconds}
scenario{slug, name, goal, channel}
testChatRun{driverMode, driverMemberId}

The chat transcript is also automatically prepended to the prompt sent to the grader, so you don't have to embed it manually unless you want a custom format.

Example Rubric Prompts

Empathy and clarity

jinja2

Protocol adherence

jinja2

Custom transcript formatting

If you need to render the transcript yourself (e.g. to filter out tool calls):

jinja2

Permissions

Qualifications, scenarios, personas, and test chat runs all use the standard members scopes:

ActionRequired Scope
View qualifications, scenarios, personas, runsmembers:read
Create/edit qualifications, scenarios, personas; start runs; trigger evaluationmembers:write

Tips

  • Start with one expression criterion. assignment.outcome == "success" catches most regressions. Add more as you find scenarios that pass for the wrong reasons.
  • Use messages.exists(...) for tool-call checks. It gives you ordering context and per-call argument inspection that aggregate variables can't.
  • Use (?i) in matches() for case-insensitive content checks.
  • Test your CEL expressions against a real run before relying on them. The Runs tab shows whether each criterion passed or failed.
  • Reach for rubric criteria sparingly. They cost LLM spend on every evaluation, and the grader is helpful but not infallible. Expressions are free and deterministic — prefer them when the judgment is mechanical.
  • Keep personas reusable. A persona shouldn't bake in a specific goal — that belongs on the scenario. The same "67-year-old caller refilling a prescription" persona can power happy-path, suspicious-caller, and angry-caller scenarios.

Test Data Safety

Test chat runs never touch real patient data:

  • Test members are always created fresh from the persona, marked as test-only, and isolated from production data.
  • The transcript builder reads from test assignments only.
  • Personas are templates, not real members.
  • Live Operator Mode — qualifications no longer gate routing; "who can answer" is configured on the workflow itself.
  • CEL Expressions — write expression criteria for automated evaluation.
  • Workflows — the workflows you build qualifications for.