Scoring & Metrics¶
This guide covers how the evaluation harness scores model outputs against gold-standard references, including the scorer pipeline, metrics, reductions, and aggregations.
Overview¶
After the model produces responses and any filters are applied, the scoring pipeline turns those responses into final metric values:
flowchart LR
A["Filtered Responses"] --> B["Score"]
B --> C["Reduce"]
C --> D["Aggregate"]
B -. "per-sample<br>metric values" .-> C
C -. "per-doc<br>reduced values" .-> D
D -. "single<br>score" .-> E["Results"]
- Score — Compare each response to the gold reference (e.g., does the output match the correct answer?)
- Reduce — If
repeats > 1, collapse multiple scores per document into one (e.g., take the first, majority vote, mean) - Aggregate — Combine per-document scores into a single number (e.g., mean accuracy across all documents)
Default behavior¶
For most tasks, you don't need to configure scoring explicitly:
- Multiple-choice tasks (
output_type: multiple_choice): Automatically scored withacc(accuracy) andacc_norm(length-normalized accuracy) - Generation tasks (
output_type: generate_until): No default metrics — you must specifymetric_list - Loglikelihood tasks (
output_type: loglikelihood_rolling): Automatically scored withperplexity,word_perplexity,byte_perplexity, andbits_per_byte
Configuring metrics in YAML¶
Basic metric_list¶
List the metrics you want to compute:
Each entry is a MetricConfig with these fields:
| Field | Type | Description |
|---|---|---|
metric |
str (required) | Name of a registered metric (e.g., acc, exact_match, bleu) |
aggregation |
str | How to combine per-doc scores into one number (default: metric's registered default) |
higher_is_better |
bool | Whether higher values indicate better performance (default: metric's registered default) |
reduction |
str | How to reduce repeated runs per document (default: take_first) |
kwargs |
dict | Extra arguments forwarded to the metric function |
Passing arguments to metrics¶
Extra fields in a metric config entry are treated as kwargs:
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true # Passed as kwarg to exact_match()
ignore_punctuation: false
regexes_to_ignore:
- ","
- "\\$"
Built-in metrics¶
Metrics¶
| Name | Description | Output types |
|---|---|---|
acc |
Accuracy (correct / total) | multiple_choice |
acc_norm |
Length-normalized accuracy | multiple_choice |
acc_mutual_info |
Baseline loglikelihood-normalized accuracy | multiple_choice |
exact_match |
Exact string match | generate_until |
perplexity |
Perplexity | loglikelihood_rolling |
word_perplexity |
Per-word perplexity | loglikelihood_rolling |
byte_perplexity |
Per-byte perplexity | loglikelihood_rolling |
bits_per_byte |
Bits per byte | loglikelihood_rolling |
matthews_corrcoef |
Matthews correlation coefficient | multiple_choice |
f1 |
F1 score | multiple_choice |
bleu |
BLEU score (corpus-level) | generate_until |
chrf |
chrF score (corpus-level) | generate_until |
ter |
Translation error rate (corpus-level) | generate_until |
All metrics supported by HuggingFace Evaluate can also be used — if a metric name isn't recognized as a built-in, the harness will attempt to load it from HF Evaluate.
Aggregation functions¶
| Name | Description |
|---|---|
mean |
Arithmetic mean |
median |
Median |
perplexity |
Exp of mean log-likelihood |
weighted_perplexity |
Token-weighted perplexity |
bits_per_byte |
Bits per byte aggregation |
Reduction functions (for repeats > 1)¶
| Name | Description |
|---|---|
take_first |
Use only the first repeat's score (default) |
mean |
Average across all repeats |
pass_at_k |
Probabilistic estimate of passing at least once in k attempts (standard Codex formula) |
Scorers¶
Scorers encapsulate the full scoring pipeline — which filters to apply, which metrics to compute, and how to aggregate. For most tasks, metric_list and filter_list are sufficient and scorers are configured implicitly. Use explicit scorer config when you need custom scoring logic like LLM-as-judge, code execution, or reusable scoring patterns across tasks. See Custom Scorers for implementation details.
Built-in scorers¶
| Scorer | For | Description |
|---|---|---|
GenScorer |
generate_until tasks |
Configurable generation scoring |
LLScorer |
loglikelihood / multiple_choice tasks |
Log-likelihood based scoring |
ChoiceMatchScorer |
generate_until tasks |
Extract and match a letter answer (A/B/C/D) |
FirstTokenScorer |
generate_until tasks |
Score based on the first generated token |
RegexExtractionScorer |
generate_until tasks |
Extract an answer using a regex pattern |
Configuring scorers in YAML¶
Scorers are configured via the scorer field or are auto-configured by formats:
# Explicit scorer configuration
scorer:
type: regex_extraction
kwargs:
regex_pattern: "The answer is (\\w+)"
In practice, most tasks use the implicit scorer that is set up by the filter_list and metric_list fields, or by the formats field if using prompt formats.
Precedence rules¶
When the harness resolves scoring configuration, the precedence is:
- Explicit config —
filter_list/metric_listin the YAML or passed as overrides - Scorer class defaults —
default_filter_cfg/default_metric_cfgClassVars on the scorer - Fallback —
noopfilter / output-type-appropriate default metrics
Repeats and reduction¶
When repeats > 1, the model runs on each document multiple times (useful for self-consistency or sampling-based evaluation):
Each repeat produces a separate response, and the reduction function collapses them:
The default reduction is take_first — only the first repeat is scored. For self-consistency patterns, combine repeats with filter pipelines that do majority voting (see Filters).
Per-pipeline metrics¶
When using multiple filter pipelines, each pipeline can have its own metrics:
filter_list:
- name: "strict-match"
filter:
- function: "regex"
regex_pattern: "The answer is (\\d+)"
- function: "take_first"
metric_list:
- metric: exact_match
- name: "flexible-match"
filter:
- function: "take_first"
metric_list:
- metric: exact_match
ignore_case: true
If a pipeline doesn't specify metric_list, it inherits the task-level metric_list.
Custom metrics¶
For metrics not built into the harness, you can use !function to reference a Python function:
For more details on implementing custom metrics, including the Metric[_T, _K] generic class and registration, see Custom Metrics & Filters.