evaluate()¶
Submit agent output for risk-based evaluation. The Coalex platform computes quality metrics, assigns a risk score, and returns an automated decision: auto-approve, escalate for human review, or reject.
Signature¶
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
request_id |
str |
required | Links this evaluation to an observed trace. Use the same request ID passed to coalex_context(). |
input |
dict |
required | The agent's input data (e.g., user question, context documents). |
output |
dict |
required | The agent's actual output (e.g., generated answer). |
metrics |
dict[str, list[str]] |
required | Map of output field names to lists of metric function names. Metrics are declared at evaluate-time and computed at resolve-time from human corrections. |
metadata |
dict \| None |
None |
Arbitrary metadata to attach to the evaluation (e.g., model name, prompt version, user segment). |
All parameters are keyword-only (enforced by *).
Renamed from trace_id
The trace_id parameter was renamed to request_id in v1.0.0 to better reflect its purpose. The old trace_id parameter is still accepted for backward compatibility but is deprecated.
Requires register() first
evaluate() uses the endpoint and API key configured by register(). Calling it before register() raises RuntimeError.
Returns¶
EvaluationDecision¶
| Field | Type | Description |
|---|---|---|
status |
str |
The evaluation decision. One of "auto_approved", "escalated", or "rejected". |
risk_score |
float |
Risk score between 0.0 (no risk) and 1.0 (maximum risk). |
escalation_id |
str \| None |
Unique escalation identifier. Present only when status == "escalated". Use this with resolve() to submit human review. |
Known Metrics¶
The metrics parameter accepts the following metric function names. Unknown metric names raise ValueError at call time.
| Metric | Description |
|---|---|
f1 |
Token-level F1 score. Measures precision and recall of overlapping tokens between output and expected. |
word_overlap |
Fraction of words in the expected output that appear in the actual output. |
bleu |
BLEU score. Standard machine translation metric measuring n-gram overlap. |
rouge_l |
ROUGE-L score. Longest common subsequence-based metric. |
semantic_similarity |
Cosine similarity between embedding vectors of the output and expected text. |
exact_match |
Binary metric: 1.0 if output exactly matches expected, 0.0 otherwise. |
levenshtein |
Normalized Levenshtein distance (0.0 = identical, 1.0 = completely different). |
contains |
Binary metric: 1.0 if the expected text is contained within the output, 0.0 otherwise. |
Choosing metrics
- Use
semantic_similarityfor free-form text where meaning matters more than exact wording. - Use
f1orrouge_lfor extractive QA where token overlap is meaningful. - Use
exact_matchfor structured fields like codes, IDs, or categories. - Use
containsto check if specific phrases or keywords appear in the output. - Combine multiple metrics per field for more robust evaluation.
Examples¶
Basic evaluation¶
import coalex
decision = coalex.evaluate(
request_id="abc-123-def",
input={"question": "What is the maximum coverage?"},
output={"answer": "The maximum coverage is $500,000."},
metrics={"answer": ["f1", "semantic_similarity"]},
)
print(decision.status) # "auto_approved"
print(decision.risk_score) # 0.12
import { evaluate } from "@coalex-ai/sdk";
const decision = await evaluate({
requestId: "abc-123-def",
input: { question: "What is the maximum coverage?" },
output: { answer: "The maximum coverage is $500,000." },
metrics: { answer: ["f1", "semantic_similarity"] },
});
console.log(decision.status); // "auto_approved"
console.log(decision.riskScore); // 0.12
Multi-field evaluation¶
decision = coalex.evaluate(
request_id="trace-789",
input={
"claim_text": "Patient reports knee pain after fall.",
},
output={
"diagnosis": "Acute knee injury, likely meniscus tear",
"icd_code": "S83.2",
"recommendation": "MRI recommended within 48 hours",
},
metrics={
"diagnosis": ["semantic_similarity", "f1"],
"icd_code": ["exact_match"],
"recommendation": ["semantic_similarity", "contains"],
},
)
const decision = await evaluate({
requestId: "trace-789",
input: {
claim_text: "Patient reports knee pain after fall.",
},
output: {
diagnosis: "Acute knee injury, likely meniscus tear",
icd_code: "S83.2",
recommendation: "MRI recommended within 48 hours",
},
metrics: {
diagnosis: ["semantic_similarity", "f1"],
icd_code: ["exact_match"],
recommendation: ["semantic_similarity", "contains"],
},
});
Handling escalations¶
decision = coalex.evaluate(
request_id="trace-456",
input={"question": "Is this drug safe during pregnancy?"},
output={"answer": "Yes, it is generally safe."},
metrics={"answer": ["semantic_similarity", "f1"]},
)
if decision.status == "escalated":
print(f"Escalated! ID: {decision.escalation_id}")
print(f"Risk score: {decision.risk_score}")
# Human review needed -- use coalex.resolve()
elif decision.status == "auto_approved":
print("Output approved automatically.")
elif decision.status == "rejected":
print("Output rejected -- do not serve to user.")
const decision = await evaluate({
requestId: "trace-456",
input: { question: "Is this drug safe during pregnancy?" },
output: { answer: "Yes, it is generally safe." },
metrics: { answer: ["semantic_similarity", "f1"] },
});
if (decision.status === "escalated") {
console.log(`Escalated! ID: ${decision.escalationId}`);
console.log(`Risk score: ${decision.riskScore}`);
// Human review needed -- use resolve()
} else if (decision.status === "auto_approved") {
console.log("Output approved automatically.");
} else if (decision.status === "rejected") {
console.log("Output rejected -- do not serve to user.");
}
Validation¶
- Unknown metrics: If any metric name in
metricsis not in the known metrics list,evaluate()raisesValueErrorimmediately without making a network call. - Missing
register(): RaisesRuntimeErrorifregister()has not been called. - Network errors: HTTP errors from the Coalex API raise
httpx.HTTPStatusError.
Notes¶
- The evaluation is performed server-side by the Coalex platform. The SDK sends the input, output, and metrics to the API and receives the decision.
- Metrics are declared at evaluate-time but computed at resolve-time from human corrections. When a human reviewer provides corrections via
resolve(), metrics compare the original output against the corrections. - All evaluations (including
auto_approved) are stored in the escalations table for full audit trail. - The risk score is based on the agent's health score, not on metric comparisons at evaluate-time.
API Reference¶
coalex.evaluate ¶
SDK evaluate() -- submit agent output for risk assessment.
Classes¶
EvaluationDecision
dataclass
¶
Result of an evaluate() call.
Source code in coalex/evaluate.py
Functions¶
evaluate ¶
evaluate(
*,
request_id: str | None = None,
trace_id: str | None = None,
input: dict,
output: dict,
metrics: dict[str, list[str]] | None = None,
metadata: dict | None = None,
_config: CoalexConfig | None = None,
_api_key: str | None = None,
) -> EvaluationDecision
Submit agent output for risk-based evaluation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
request_id
|
str | None
|
Links to an observed request. |
None
|
trace_id
|
str | None
|
Deprecated — use request_id instead. |
None
|
input
|
dict
|
Agent input data. |
required |
output
|
dict
|
Agent output data. |
required |
metrics
|
dict[str, list[str]] | None
|
Map of output field names to metric function names (optional). |
None
|
metadata
|
dict | None
|
Metadata for field-specific quality focus (optional). |
None
|
_config
|
CoalexConfig | None
|
Override config (testing only). |
None
|
_api_key
|
str | None
|
Override API key (testing only). |
None
|