Skip to content

Metrics Catalog

Coalex uses a unified metrics table to store all types of metrics from your AI agents. This document covers both observability metrics (automatically captured) and quality metrics (computed from human corrections).


Quality Metrics

Quality metrics are declared at evaluate() time and computed at resolve() time when a human reviewer provides corrections. They measure how close the agent's original output was to the corrected version.

Metric Description Range Best For
f1 Token-level F1 score (precision + recall) 0.0 – 1.0 Extractive QA
semantic_similarity Cosine similarity between embedding vectors 0.0 – 1.0 Free-form text
exact_match Binary: output matches expected exactly 0.0 or 1.0 Codes, IDs, categories
rouge_l Longest common subsequence overlap 0.0 – 1.0 Summaries
bleu N-gram overlap (machine translation standard) 0.0 – 1.0 Translation-style outputs
word_overlap Fraction of expected words in the output 0.0 – 1.0 Keyword presence
contains Binary: expected text contained in output 0.0 or 1.0 Required phrases
levenshtein Normalized edit distance 0.0 – 1.0 Character-level comparison

Choosing Metrics

  • Use semantic_similarity for free-form text where meaning matters more than exact wording.
  • Use f1 or rouge_l for extractive QA where token overlap is meaningful.
  • Use exact_match for structured fields like codes, IDs, or categories.
  • Use contains to check if specific phrases or keywords appear in the output.
  • Combine multiple metrics per field for more robust evaluation.

Declaring Metrics

decision = coalex.evaluate(
    request_id="req-123",
    input={"question": "What is the diagnosis?"},
    output={
        "diagnosis": "Acute bronchitis",
        "icd_code": "J20.9",
    },
    metrics={
        "diagnosis": ["semantic_similarity", "f1"],
        "icd_code": ["exact_match"],
    },
)

Metric Computation

Metrics are computed when resolve(decision="corrected") is called:

result = coalex.resolve(
    escalation_id="esc-001",
    decision="corrected",
    corrections={"diagnosis": "Community-acquired pneumonia", "icd_code": "J18.9"},
    reviewer={"name": "Dr. Jones", "email": "dr.jones@hospital.org"},
)

for m in result.metrics:
    print(f"{m.field}: {m.metrics}")
    # diagnosis: {"semantic_similarity": 0.45, "f1": 0.32}
    # icd_code: {"exact_match": 0.0}

Observability Metrics

Automatically captured by auto-instrumentation and enriched by the Transformer.

Token Metrics

Metric ID Category Unit Description
input tokens count Input/prompt tokens
output tokens count Output/completion tokens
total tokens count Total tokens (input + output)

Performance Metrics

Metric ID Category Unit Description
latency performance ms Request latency in milliseconds
throughput performance req/s Requests per second

Cost Metrics

Metric ID Category Unit Description
total cost USD Total cost for the request
per_token cost USD Cost per token

Sustainability Metrics

Powered by ecologits.

Metric ID Category Unit Description
energy sustainability kWh Energy consumption
gwp sustainability kgCO2eq Global Warming Potential (carbon footprint)
adpe sustainability kgSbeq Abiotic Depletion Potential for Elements
pe sustainability MJ Primary Energy consumption

Retrieval Metrics

Metric ID Category Unit Description
document_count retrieval count Number of documents retrieved
avg_score retrieval score (0-1) Average relevance score

Unified Metrics Schema

All metrics are stored in a single table:

Column Type Description
id integer Auto-generated primary key
agent_id text Agent identifier
account_id text Account identifier
request_id text Session/request identifier
metric_id text Metric name (e.g., energy, input, latency)
metric_type text Metric category (e.g., sustainability, tokens, performance)
value float The numeric metric value
metadata jsonb Additional context (model_name, provider, etc.)
created_at timestamp When the metric was recorded
prompt_version text Prompt version for A/B testing
reviewing_agent_id text For human review workflows
task_id text Associated task identifier

Metadata Structure

{
    "span_id": "b27edd86b453ff09",
    "model_name": "gpt-4o",
    "provider": "openai",
    "span_name": "ChatOpenAI",
    "unit": "kWh",
    "output_tokens": 143,
    "latency_ms": 2913.25
}

Querying Metrics

By Category

SELECT * FROM metrics
WHERE agent_id = 'claims-bot'
  AND metric_type = 'sustainability'
ORDER BY created_at DESC;

By Specific Metric

SELECT created_at, value, metadata->>'model_name' as model
FROM metrics
WHERE agent_id = 'claims-bot'
  AND metric_type = 'sustainability'
  AND metric_id = 'energy'
ORDER BY created_at DESC;

Aggregate by Request

SELECT metric_id, SUM(value) as total
FROM metrics
WHERE request_id = 'req-123'
  AND metric_type = 'tokens'
GROUP BY metric_id;

Compare Prompt Versions

SELECT
    prompt_version,
    metric_id,
    AVG(value) as avg_value,
    COUNT(*) as samples
FROM metrics
WHERE agent_id = 'claims-bot'
  AND prompt_version IS NOT NULL
GROUP BY prompt_version, metric_id;

Best Practices

  1. Use request_id for session tracking — Group all metrics from a single request with coalex_context().
  2. Use prompt_version for A/B testing — Compare metrics across prompt variants.
  3. Monitor cost and sustainability together — Track both financial and environmental impact.
  4. Combine quality metrics — Use multiple metrics per field for robust evaluation.