Evaluate¶
The evaluation step assesses the risk of an agent's output and decides whether to auto-approve, escalate for human review, or reject.
How Risk Scoring Works¶
When you call evaluate(), the Coalex platform:
- Looks up the agent's health score — based on historical evaluation outcomes
- Computes a risk score (0.0 – 1.0) — higher scores indicate higher risk
- Applies policy thresholds — compares the risk score against the agent's configured thresholds
- Returns a decision —
auto_approved,escalated, orrejected
graph LR
A[evaluate call] --> B[Health Score Lookup]
B --> C[Risk Score Computation]
C --> D{Policy Thresholds}
D -->|Low risk| E[auto_approved]
D -->|Medium risk| F[escalated]
D -->|High risk| G[rejected]
Declaring Metrics¶
Metrics are declared at evaluate-time but computed at resolve-time. This means you specify which metrics you want, and Coalex computes them when a human reviewer provides corrections.
decision = coalex.evaluate(
request_id="req-123",
input={"question": "What is the diagnosis?"},
output={
"diagnosis": "Acute bronchitis",
"icd_code": "J20.9",
},
metrics={
"diagnosis": ["semantic_similarity", "f1"],
"icd_code": ["exact_match"],
},
)
Each key in metrics must match a key in output. The values are lists of metric function names.
Available Metrics¶
| Metric | Best for |
|---|---|
semantic_similarity |
Free-form text where meaning matters |
f1 |
Extractive QA with token overlap |
rouge_l |
Summary-style text |
bleu |
Translation-style outputs |
exact_match |
Codes, IDs, categories |
contains |
Checking for required phrases |
word_overlap |
Keyword presence |
levenshtein |
Edit distance comparison |
See the Metrics Catalog for full details.
Decision Statuses¶
| Status | Risk Level | Action |
|---|---|---|
auto_approved |
Low | Safe to serve to the user |
escalated |
Medium | Requires human review before serving |
rejected |
High | Do not serve — use a fallback |
All evaluations (including auto-approved) are stored in the escalations table for full audit trail.
Multi-Field Evaluation¶
Evaluate multiple output fields independently:
decision = coalex.evaluate(
request_id="req-456",
input={"claim_text": "Patient reports knee pain after fall."},
output={
"diagnosis": "Acute knee injury, likely meniscus tear",
"icd_code": "S83.2",
"recommendation": "MRI recommended within 48 hours",
},
metrics={
"diagnosis": ["semantic_similarity", "f1"],
"icd_code": ["exact_match"],
"recommendation": ["semantic_similarity", "contains"],
},
)
Metadata¶
Attach arbitrary metadata for filtering and analysis:
decision = coalex.evaluate(
request_id="req-789",
input={...},
output={...},
metrics={...},
metadata={
"model": "gpt-4o",
"prompt_version": "v2.1",
"user_segment": "enterprise",
},
)
SDK Reference¶
evaluate()— Full API referenceresolve()— Resolve escalated evaluations