Skip to content

evaluate()

Submit agent output for risk-based evaluation. The Coalex platform computes quality metrics, assigns a risk score, and returns an automated decision: auto-approve, escalate for human review, or reject.


Signature

def evaluate(
    *,
    request_id: str,
    input: dict,
    output: dict,
    metrics: dict[str, list[str]],
    metadata: dict | None = None,
) -> EvaluationDecision
interface EvaluateOptions {
    requestId: string;
    input: Record<string, unknown>;
    output: Record<string, unknown>;
    metrics: Record<string, string[]>;
    metadata?: Record<string, unknown>;
}

async function evaluate(options: EvaluateOptions): Promise<EvaluationDecision>

Parameters

Parameter Type Default Description
request_id str required Links this evaluation to an observed trace. Use the same request ID passed to coalex_context().
input dict required The agent's input data (e.g., user question, context documents).
output dict required The agent's actual output (e.g., generated answer).
metrics dict[str, list[str]] required Map of output field names to lists of metric function names. Metrics are declared at evaluate-time and computed at resolve-time from human corrections.
metadata dict \| None None Arbitrary metadata to attach to the evaluation (e.g., model name, prompt version, user segment).

All parameters are keyword-only (enforced by *).

Renamed from trace_id

The trace_id parameter was renamed to request_id in v1.0.0 to better reflect its purpose. The old trace_id parameter is still accepted for backward compatibility but is deprecated.

Requires register() first

evaluate() uses the endpoint and API key configured by register(). Calling it before register() raises RuntimeError.


Returns

EvaluationDecision

@dataclass(frozen=True)
class EvaluationDecision:
    status: str            # "auto_approved" | "escalated" | "rejected"
    risk_score: float      # 0.0 - 1.0
    escalation_id: str | None  # present when status == "escalated"
interface EvaluationDecision {
    readonly status: "auto_approved" | "escalated" | "rejected";
    readonly riskScore: number;        // 0.0 - 1.0
    readonly escalationId?: string;    // present when status === "escalated"
}
Field Type Description
status str The evaluation decision. One of "auto_approved", "escalated", or "rejected".
risk_score float Risk score between 0.0 (no risk) and 1.0 (maximum risk).
escalation_id str \| None Unique escalation identifier. Present only when status == "escalated". Use this with resolve() to submit human review.

Known Metrics

The metrics parameter accepts the following metric function names. Unknown metric names raise ValueError at call time.

Metric Description
f1 Token-level F1 score. Measures precision and recall of overlapping tokens between output and expected.
word_overlap Fraction of words in the expected output that appear in the actual output.
bleu BLEU score. Standard machine translation metric measuring n-gram overlap.
rouge_l ROUGE-L score. Longest common subsequence-based metric.
semantic_similarity Cosine similarity between embedding vectors of the output and expected text.
exact_match Binary metric: 1.0 if output exactly matches expected, 0.0 otherwise.
levenshtein Normalized Levenshtein distance (0.0 = identical, 1.0 = completely different).
contains Binary metric: 1.0 if the expected text is contained within the output, 0.0 otherwise.

Choosing metrics

  • Use semantic_similarity for free-form text where meaning matters more than exact wording.
  • Use f1 or rouge_l for extractive QA where token overlap is meaningful.
  • Use exact_match for structured fields like codes, IDs, or categories.
  • Use contains to check if specific phrases or keywords appear in the output.
  • Combine multiple metrics per field for more robust evaluation.

Examples

Basic evaluation

import coalex

decision = coalex.evaluate(
    request_id="abc-123-def",
    input={"question": "What is the maximum coverage?"},
    output={"answer": "The maximum coverage is $500,000."},
    metrics={"answer": ["f1", "semantic_similarity"]},
)

print(decision.status)      # "auto_approved"
print(decision.risk_score)  # 0.12
import { evaluate } from "@coalex-ai/sdk";

const decision = await evaluate({
    requestId: "abc-123-def",
    input: { question: "What is the maximum coverage?" },
    output: { answer: "The maximum coverage is $500,000." },
    metrics: { answer: ["f1", "semantic_similarity"] },
});

console.log(decision.status);    // "auto_approved"
console.log(decision.riskScore); // 0.12

Multi-field evaluation

decision = coalex.evaluate(
    request_id="trace-789",
    input={
        "claim_text": "Patient reports knee pain after fall.",
    },
    output={
        "diagnosis": "Acute knee injury, likely meniscus tear",
        "icd_code": "S83.2",
        "recommendation": "MRI recommended within 48 hours",
    },
    metrics={
        "diagnosis": ["semantic_similarity", "f1"],
        "icd_code": ["exact_match"],
        "recommendation": ["semantic_similarity", "contains"],
    },
)
const decision = await evaluate({
    requestId: "trace-789",
    input: {
        claim_text: "Patient reports knee pain after fall.",
    },
    output: {
        diagnosis: "Acute knee injury, likely meniscus tear",
        icd_code: "S83.2",
        recommendation: "MRI recommended within 48 hours",
    },
    metrics: {
        diagnosis: ["semantic_similarity", "f1"],
        icd_code: ["exact_match"],
        recommendation: ["semantic_similarity", "contains"],
    },
});

Handling escalations

decision = coalex.evaluate(
    request_id="trace-456",
    input={"question": "Is this drug safe during pregnancy?"},
    output={"answer": "Yes, it is generally safe."},
    metrics={"answer": ["semantic_similarity", "f1"]},
)

if decision.status == "escalated":
    print(f"Escalated! ID: {decision.escalation_id}")
    print(f"Risk score: {decision.risk_score}")
    # Human review needed -- use coalex.resolve()
elif decision.status == "auto_approved":
    print("Output approved automatically.")
elif decision.status == "rejected":
    print("Output rejected -- do not serve to user.")
const decision = await evaluate({
    requestId: "trace-456",
    input: { question: "Is this drug safe during pregnancy?" },
    output: { answer: "Yes, it is generally safe." },
    metrics: { answer: ["semantic_similarity", "f1"] },
});

if (decision.status === "escalated") {
    console.log(`Escalated! ID: ${decision.escalationId}`);
    console.log(`Risk score: ${decision.riskScore}`);
    // Human review needed -- use resolve()
} else if (decision.status === "auto_approved") {
    console.log("Output approved automatically.");
} else if (decision.status === "rejected") {
    console.log("Output rejected -- do not serve to user.");
}

Validation

  • Unknown metrics: If any metric name in metrics is not in the known metrics list, evaluate() raises ValueError immediately without making a network call.
  • Missing register(): Raises RuntimeError if register() has not been called.
  • Network errors: HTTP errors from the Coalex API raise httpx.HTTPStatusError.

Notes

  • The evaluation is performed server-side by the Coalex platform. The SDK sends the input, output, and metrics to the API and receives the decision.
  • Metrics are declared at evaluate-time but computed at resolve-time from human corrections. When a human reviewer provides corrections via resolve(), metrics compare the original output against the corrections.
  • All evaluations (including auto_approved) are stored in the escalations table for full audit trail.
  • The risk score is based on the agent's health score, not on metric comparisons at evaluate-time.

API Reference

coalex.evaluate

SDK evaluate() -- submit agent output for risk assessment.

Classes

EvaluationDecision dataclass

Result of an evaluate() call.

Source code in coalex/evaluate.py
@dataclasses.dataclass(frozen=True)
class EvaluationDecision:
    """Result of an evaluate() call."""

    status: str  # "auto_approved" | "escalated" | "rejected"
    risk_score: float  # 0.0-1.0
    escalation_id: str | None  # present when status == "escalated"

Functions

evaluate

evaluate(
    *,
    request_id: str | None = None,
    trace_id: str | None = None,
    input: dict,
    output: dict,
    metrics: dict[str, list[str]] | None = None,
    metadata: dict | None = None,
    _config: CoalexConfig | None = None,
    _api_key: str | None = None,
) -> EvaluationDecision

Submit agent output for risk-based evaluation.

Parameters:

Name Type Description Default
request_id str | None

Links to an observed request.

None
trace_id str | None

Deprecated — use request_id instead.

None
input dict

Agent input data.

required
output dict

Agent output data.

required
metrics dict[str, list[str]] | None

Map of output field names to metric function names (optional).

None
metadata dict | None

Metadata for field-specific quality focus (optional).

None
_config CoalexConfig | None

Override config (testing only).

None
_api_key str | None

Override API key (testing only).

None
Source code in coalex/evaluate.py
def evaluate(
    *,
    request_id: str | None = None,
    trace_id: str | None = None,
    input: dict,
    output: dict,
    metrics: dict[str, list[str]] | None = None,
    metadata: dict | None = None,
    _config: CoalexConfig | None = None,
    _api_key: str | None = None,
) -> EvaluationDecision:
    """Submit agent output for risk-based evaluation.

    Args:
        request_id: Links to an observed request.
        trace_id: Deprecated — use request_id instead.
        input: Agent input data.
        output: Agent output data.
        metrics: Map of output field names to metric function names (optional).
        metadata: Metadata for field-specific quality focus (optional).
        _config: Override config (testing only).
        _api_key: Override API key (testing only).
    """
    if trace_id is not None and request_id is None:
        import warnings

        warnings.warn("trace_id is deprecated, use request_id", DeprecationWarning, stacklevel=2)
        request_id = trace_id
    if request_id is None:
        raise TypeError("evaluate() requires 'request_id' (or deprecated 'trace_id')")

    import coalex as _sdk

    cfg = _config or _sdk._config
    api_key = _api_key or _sdk._api_key
    if cfg is None:
        raise RuntimeError("coalex.register() must be called before evaluate()")

    # Validate metrics against known catalog (only when explicitly provided)
    if metrics:
        for field, metric_list in metrics.items():
            unknown = set(metric_list) - KNOWN_METRICS
            if unknown:
                raise ValueError(f"Unknown metrics for field '{field}': {unknown}")

    payload: dict = {
        "request_id": request_id,
        "input": input,
        "output": output,
    }
    if metrics:
        payload["metrics"] = metrics
    if metadata:
        payload["metadata"] = metadata

    with httpx.Client(timeout=30.0) as client:
        resp = client.post(
            f"{cfg.endpoint}/api/v1/evaluate",
            json=payload,
            headers={"Authorization": f"Bearer {api_key}"},
        )
        resp.raise_for_status()

    data = resp.json()
    return EvaluationDecision(
        status=data["status"],
        risk_score=data["risk_score"],
        escalation_id=data.get("escalation_id"),
    )