Skip to content

Evaluate

The evaluation step assesses the risk of an agent's output and decides whether to auto-approve, escalate for human review, or reject.


How Risk Scoring Works

When you call evaluate(), the Coalex platform:

  1. Looks up the agent's health score — based on historical evaluation outcomes
  2. Computes a risk score (0.0 – 1.0) — higher scores indicate higher risk
  3. Applies policy thresholds — compares the risk score against the agent's configured thresholds
  4. Returns a decisionauto_approved, escalated, or rejected
graph LR
    A[evaluate call] --> B[Health Score Lookup]
    B --> C[Risk Score Computation]
    C --> D{Policy Thresholds}
    D -->|Low risk| E[auto_approved]
    D -->|Medium risk| F[escalated]
    D -->|High risk| G[rejected]

Declaring Metrics

Metrics are declared at evaluate-time but computed at resolve-time. This means you specify which metrics you want, and Coalex computes them when a human reviewer provides corrections.

decision = coalex.evaluate(
    request_id="req-123",
    input={"question": "What is the diagnosis?"},
    output={
        "diagnosis": "Acute bronchitis",
        "icd_code": "J20.9",
    },
    metrics={
        "diagnosis": ["semantic_similarity", "f1"],
        "icd_code": ["exact_match"],
    },
)

Each key in metrics must match a key in output. The values are lists of metric function names.

Available Metrics

Metric Best for
semantic_similarity Free-form text where meaning matters
f1 Extractive QA with token overlap
rouge_l Summary-style text
bleu Translation-style outputs
exact_match Codes, IDs, categories
contains Checking for required phrases
word_overlap Keyword presence
levenshtein Edit distance comparison

See the Metrics Catalog for full details.


Decision Statuses

Status Risk Level Action
auto_approved Low Safe to serve to the user
escalated Medium Requires human review before serving
rejected High Do not serve — use a fallback

All evaluations (including auto-approved) are stored in the escalations table for full audit trail.


Multi-Field Evaluation

Evaluate multiple output fields independently:

decision = coalex.evaluate(
    request_id="req-456",
    input={"claim_text": "Patient reports knee pain after fall."},
    output={
        "diagnosis": "Acute knee injury, likely meniscus tear",
        "icd_code": "S83.2",
        "recommendation": "MRI recommended within 48 hours",
    },
    metrics={
        "diagnosis": ["semantic_similarity", "f1"],
        "icd_code": ["exact_match"],
        "recommendation": ["semantic_similarity", "contains"],
    },
)

Metadata

Attach arbitrary metadata for filtering and analysis:

decision = coalex.evaluate(
    request_id="req-789",
    input={...},
    output={...},
    metrics={...},
    metadata={
        "model": "gpt-4o",
        "prompt_version": "v2.1",
        "user_segment": "enterprise",
    },
)

SDK Reference