Metrics Catalog
Coalex uses a unified metrics table to store all types of metrics from your AI agents. This document covers both observability metrics (automatically captured) and quality metrics (computed from human corrections).
Quality Metrics
Quality metrics are declared at evaluate() time and computed at resolve() time when a human reviewer provides corrections. They measure how close the agent's original output was to the corrected version.
| Metric |
Description |
Range |
Best For |
f1 |
Token-level F1 score (precision + recall) |
0.0 – 1.0 |
Extractive QA |
semantic_similarity |
Cosine similarity between embedding vectors |
0.0 – 1.0 |
Free-form text |
exact_match |
Binary: output matches expected exactly |
0.0 or 1.0 |
Codes, IDs, categories |
rouge_l |
Longest common subsequence overlap |
0.0 – 1.0 |
Summaries |
bleu |
N-gram overlap (machine translation standard) |
0.0 – 1.0 |
Translation-style outputs |
word_overlap |
Fraction of expected words in the output |
0.0 – 1.0 |
Keyword presence |
contains |
Binary: expected text contained in output |
0.0 or 1.0 |
Required phrases |
levenshtein |
Normalized edit distance |
0.0 – 1.0 |
Character-level comparison |
Choosing Metrics
- Use
semantic_similarity for free-form text where meaning matters more than exact wording.
- Use
f1 or rouge_l for extractive QA where token overlap is meaningful.
- Use
exact_match for structured fields like codes, IDs, or categories.
- Use
contains to check if specific phrases or keywords appear in the output.
- Combine multiple metrics per field for more robust evaluation.
Declaring Metrics
decision = coalex.evaluate(
request_id="req-123",
input={"question": "What is the diagnosis?"},
output={
"diagnosis": "Acute bronchitis",
"icd_code": "J20.9",
},
metrics={
"diagnosis": ["semantic_similarity", "f1"],
"icd_code": ["exact_match"],
},
)
Metric Computation
Metrics are computed when resolve(decision="corrected") is called:
result = coalex.resolve(
escalation_id="esc-001",
decision="corrected",
corrections={"diagnosis": "Community-acquired pneumonia", "icd_code": "J18.9"},
reviewer={"name": "Dr. Jones", "email": "dr.jones@hospital.org"},
)
for m in result.metrics:
print(f"{m.field}: {m.metrics}")
# diagnosis: {"semantic_similarity": 0.45, "f1": 0.32}
# icd_code: {"exact_match": 0.0}
Observability Metrics
Automatically captured by auto-instrumentation and enriched by the Transformer.
Token Metrics
| Metric ID |
Category |
Unit |
Description |
input |
tokens |
count |
Input/prompt tokens |
output |
tokens |
count |
Output/completion tokens |
total |
tokens |
count |
Total tokens (input + output) |
| Metric ID |
Category |
Unit |
Description |
latency |
performance |
ms |
Request latency in milliseconds |
throughput |
performance |
req/s |
Requests per second |
Cost Metrics
| Metric ID |
Category |
Unit |
Description |
total |
cost |
USD |
Total cost for the request |
per_token |
cost |
USD |
Cost per token |
Sustainability Metrics
Powered by ecologits.
| Metric ID |
Category |
Unit |
Description |
energy |
sustainability |
kWh |
Energy consumption |
gwp |
sustainability |
kgCO2eq |
Global Warming Potential (carbon footprint) |
adpe |
sustainability |
kgSbeq |
Abiotic Depletion Potential for Elements |
pe |
sustainability |
MJ |
Primary Energy consumption |
Retrieval Metrics
| Metric ID |
Category |
Unit |
Description |
document_count |
retrieval |
count |
Number of documents retrieved |
avg_score |
retrieval |
score (0-1) |
Average relevance score |
Unified Metrics Schema
All metrics are stored in a single table:
| Column |
Type |
Description |
id |
integer |
Auto-generated primary key |
agent_id |
text |
Agent identifier |
account_id |
text |
Account identifier |
request_id |
text |
Session/request identifier |
metric_id |
text |
Metric name (e.g., energy, input, latency) |
metric_type |
text |
Metric category (e.g., sustainability, tokens, performance) |
value |
float |
The numeric metric value |
metadata |
jsonb |
Additional context (model_name, provider, etc.) |
created_at |
timestamp |
When the metric was recorded |
prompt_version |
text |
Prompt version for A/B testing |
reviewing_agent_id |
text |
For human review workflows |
task_id |
text |
Associated task identifier |
{
"span_id": "b27edd86b453ff09",
"model_name": "gpt-4o",
"provider": "openai",
"span_name": "ChatOpenAI",
"unit": "kWh",
"output_tokens": 143,
"latency_ms": 2913.25
}
Querying Metrics
By Category
SELECT * FROM metrics
WHERE agent_id = 'claims-bot'
AND metric_type = 'sustainability'
ORDER BY created_at DESC;
By Specific Metric
SELECT created_at, value, metadata->>'model_name' as model
FROM metrics
WHERE agent_id = 'claims-bot'
AND metric_type = 'sustainability'
AND metric_id = 'energy'
ORDER BY created_at DESC;
Aggregate by Request
SELECT metric_id, SUM(value) as total
FROM metrics
WHERE request_id = 'req-123'
AND metric_type = 'tokens'
GROUP BY metric_id;
Compare Prompt Versions
SELECT
prompt_version,
metric_id,
AVG(value) as avg_value,
COUNT(*) as samples
FROM metrics
WHERE agent_id = 'claims-bot'
AND prompt_version IS NOT NULL
GROUP BY prompt_version, metric_id;
Best Practices
- Use
request_id for session tracking — Group all metrics from a single request with coalex_context().
- Use
prompt_version for A/B testing — Compare metrics across prompt variants.
- Monitor cost and sustainability together — Track both financial and environmental impact.
- Combine quality metrics — Use multiple metrics per field for robust evaluation.