Skip to content

Key Concepts

This page covers the core concepts you need to understand when integrating Coalex into your AI agent.


Agents

An agent is any AI-powered system that produces outputs for end users. In Coalex, agents are first-class entities with a unique agent_id, display name, and lifecycle status.

You register agents with declare_agent() so the dashboard recognizes them before traces arrive:

coalex.declare_agent(agent_id="claims-bot", display_name="Claims Bot")

Each agent has its own health score, escalation history, and policy configuration.


Traces & Spans

A trace represents a single end-to-end request through your agent. It contains one or more spans — individual operations like LLM calls, retrieval queries, or tool invocations.

Coalex uses OpenTelemetry and OpenInference to capture spans automatically when you call auto_instrument().

Trace: claims-bot / req-456
  ├── coalex_context (ROOT)
  │   ├── retrieve_documents (RETRIEVER)
  │   ├── ChatOpenAI (LLM)
  │   │     model: gpt-4o
  │   │     tokens_in: 214, tokens_out: 143
  │   └── parse_response (CHAIN)
  └── evaluate (internal)

Use coalex_context() to create a root span that tags all child spans with the agent ID and request ID.


Evaluations

An evaluation is a risk assessment of your agent's output. Call evaluate() with the agent's input, output, and the metrics you want computed:

decision = coalex.evaluate(
    request_id="req-456",
    input={"question": "What is my deductible?"},
    output={"answer": "Your deductible is $500."},
    metrics={"answer": ["semantic_similarity", "f1"]},
)

The Coalex platform assigns a risk score (0.0 – 1.0) based on the agent's health score and returns one of three statuses:

Status Meaning
auto_approved Low risk — safe to serve to the user
escalated Medium/high risk — requires human review
rejected High risk — do not serve to the user

Escalations

When an evaluation returns status == "escalated", an escalation is created. Escalations represent agent outputs that need human review before being served to users.

Each escalation has:

  • A unique escalation_id
  • The original input and output
  • A risk score
  • A status (pendingapproved / rejected / corrected)

Route escalations to human reviewers through your own UI, Slack, email, or any notification system.


Resolutions

A resolution is the human reviewer's decision on an escalation. Call resolve() to submit the decision:

Decision Meaning
approved The output is correct as-is
rejected The output is incorrect — do not serve
corrected The reviewer provides a corrected version

When a reviewer submits corrections, Coalex computes quality metrics (F1, semantic similarity, etc.) by comparing the original output against the corrections. These metrics feed back into the agent's health score.

result = coalex.resolve(
    escalation_id="esc-001",
    decision="corrected",
    corrections={"answer": "Your deductible is $1,000 for in-network providers."},
    reviewer={"name": "Dr. Smith", "email": "dr.smith@hospital.org"},
    reason="Incorrect deductible amount.",
)

Metrics

Metrics are quality scores computed when human reviewers provide corrections. They measure how close the agent's original output was to the corrected version.

Metric Description
f1 Token-level precision and recall
semantic_similarity Cosine similarity between embeddings
exact_match Binary: did the output match exactly?
rouge_l Longest common subsequence overlap
bleu N-gram overlap (machine translation standard)
word_overlap Fraction of expected words present
contains Binary: is the expected text contained in the output?
levenshtein Normalized edit distance

Metrics are declared at evaluate-time and computed at resolve-time. See the Metrics Catalog for the full schema.


Policies

Policies define the rules that govern how evaluations are handled:

  • Risk thresholds — what risk score triggers escalation vs. auto-approval
  • Escalation routing — which reviewers receive which types of escalations
  • Metric requirements — which metrics must be computed for each agent

Policies are configured per-agent in the dashboard. See Policies.


Health Score

The health score is a rolling measure of agent reliability (0.0 – 1.0). It is computed from:

  • Historical evaluation outcomes (approval rate, rejection rate)
  • Quality metric trends (improving or degrading)
  • Escalation resolution patterns

A high health score means the agent is consistently producing correct outputs. A declining health score triggers more escalations and may change auto-approval thresholds.


The Evaluate-Resolve Loop

The core workflow in Coalex is the evaluate-resolve loop:

graph TD
    A[Agent produces output] --> B[evaluate]
    B -->|auto_approved| C[Serve to user]
    B -->|escalated| D[Human reviews]
    B -->|rejected| E[Fallback response]
    D -->|approved| C
    D -->|rejected| E
    D -->|corrected| F[Serve corrected output]
    F --> G[Metrics computed]
    G --> H[Health score updated]
    H --> B

Over time, as the health score improves, fewer outputs are escalated — your agent graduates from "pilot" to "production" with a full audit trail.


Next Steps