Evaluate¶

The evaluation step assesses the risk of an agent's output and decides whether to auto-approve, escalate for human review, or reject.

How Risk Scoring Works¶

When you call evaluate(), the Coalex platform:

Looks up the agent's health score — based on historical evaluation outcomes
Computes a risk score (0.0 – 1.0) — higher scores indicate higher risk
Applies policy thresholds — compares the risk score against the agent's configured thresholds
Returns a decision — auto_approved, escalated, or rejected

graph LR
    A[evaluate call] --> B[Health Score Lookup]
    B --> C[Risk Score Computation]
    C --> D{Policy Thresholds}
    D -->|Low risk| E[auto_approved]
    D -->|Medium risk| F[escalated]
    D -->|High risk| G[rejected]

Declaring Metrics¶

Metrics are declared at evaluate-time but computed at resolve-time. This means you specify which metrics you want, and Coalex computes them when a human reviewer provides corrections.

decision = coalex.evaluate(
    request_id="req-123",
    input={"question": "What is the diagnosis?"},
    output={
        "diagnosis": "Acute bronchitis",
        "icd_code": "J20.9",
    },
    metrics={
        "diagnosis": ["semantic_similarity", "f1"],
        "icd_code": ["exact_match"],
    },
)

Each key in metrics must match a key in output. The values are lists of metric function names.

Available Metrics¶

Metric	Best for
`semantic_similarity`	Free-form text where meaning matters
`f1`	Extractive QA with token overlap
`rouge_l`	Summary-style text
`bleu`	Translation-style outputs
`exact_match`	Codes, IDs, categories
`contains`	Checking for required phrases
`word_overlap`	Keyword presence
`levenshtein`	Edit distance comparison

See the Metrics Catalog for full details.

Decision Statuses¶

Status	Risk Level	Action
`auto_approved`	Low	Safe to serve to the user
`escalated`	Medium	Requires human review before serving
`rejected`	High	Do not serve — use a fallback

All evaluations (including auto-approved) are stored in the escalations table for full audit trail.

Multi-Field Evaluation¶

Evaluate multiple output fields independently:

decision = coalex.evaluate(
    request_id="req-456",
    input={"claim_text": "Patient reports knee pain after fall."},
    output={
        "diagnosis": "Acute knee injury, likely meniscus tear",
        "icd_code": "S83.2",
        "recommendation": "MRI recommended within 48 hours",
    },
    metrics={
        "diagnosis": ["semantic_similarity", "f1"],
        "icd_code": ["exact_match"],
        "recommendation": ["semantic_similarity", "contains"],
    },
)

Metadata¶

Attach arbitrary metadata for filtering and analysis:

decision = coalex.evaluate(
    request_id="req-789",
    input={...},
    output={...},
    metrics={...},
    metadata={
        "model": "gpt-4o",
        "prompt_version": "v2.1",
        "user_segment": "enterprise",
    },
)

SDK Reference¶

evaluate() — Full API reference
resolve() — Resolve escalated evaluations