lm_eval.scorers¶
Scoring pipeline components.
scorers
¶
Attributes¶
ReducedDoc
module-attribute
¶
Per-document reduced result: {metric_name: scalar_value}.
The doc_id is the key in the containing dict[int, ReducedDoc].
Created by Scorer.reduce from a ScoredDoc, or directly
by Scorer.import_reduced after a distributed gather.
__all__
module-attribute
¶
__all__ = ['ChoiceMatchScorer', 'FirstTokenScorer', 'GenScorer', 'LLScorer', 'MetricKey', 'ReducedDoc', 'RegexExtractionScorer', 'ScoredDoc', 'Scorer', 'build_scorer']
Classes¶
GenScorer
dataclass
¶
GenScorer(*, name: str, filter: FilterEnsemble, metrics: list[Metric] | None = None, context: dict[str, Any] = dict())
Bases: Scorer
flowchart TD
lm_eval.scorers.GenScorer[GenScorer]
lm_eval.scorers._base.Scorer[Scorer]
lm_eval.scorers._base.Scorer --> lm_eval.scorers.GenScorer
click lm_eval.scorers.GenScorer href "" "lm_eval.scorers.GenScorer"
click lm_eval.scorers._base.Scorer href "" "lm_eval.scorers._base.Scorer"
Scorer for generate_until tasks.
Extensibility hooks (from simplest to most control) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tier 1 — Config-only: Set default_filter_cfg and/or
default_metric_cfg class variables. No scoring code needed.
Tier 2 — Per-doc scoring: Override score to define custom
scoring as (reference, predictions) → {metric: [scores]}.
Full control: Override score_instances for batch scoring
(e.g. batched LLM judge calls, code sandbox pools). Use
[_extract_inputs][lm_eval.scorers.GenScorer._extract_inputs] to pull (reference, predictions, metric_kwargs)
from each document's instances.
The default call chain is:
score_instances() → score_doc() → score()
Example
@register_scorer("ai_judge")
@dataclass
class AIJudgeScorer(GenScorer):
judge_model: str = "claude-sonnet-4-6"
def score_instances(self, instances):
inputs = {did: self._extract_inputs(insts)
for did, insts in instances.items()}
ratings = batch_judge(self.judge_model,
{did: (ref, preds[0])
for did, (ref, preds, _) in inputs.items()})
return {
did: ScoredDoc(
doc_id=did, reference=ref,
scores={"judge": [ratings[did]]})
for did, (ref, preds, _) in inputs.items()
}
Functions¶
score_doc
¶
score_doc(doc_id: int, doc_instances: list[Instance]) -> ScoredDoc
Extract inputs from a document's instances and delegate to [score][lm_eval.scorers.GenScorer.score_doc.score].
Source code in lm_eval/scorers/_base.py
score
¶
score(reference: str | list[str], predictions: list[str], metric_kwargs: dict[str, Any] | None = None) -> dict[str, list[float]]
Per-document scoring. Override for custom generation scoring.
This is the simplest hook. Receives clean inputs and returns
metric scores — no need to work with Instance or ScoredDoc.
| PARAMETER | DESCRIPTION |
|---|---|
reference
|
The gold answer(s).
TYPE:
|
predictions
|
Model predictions (one per repeat).
TYPE:
|
metric_kwargs
|
Optional per-instance metric overrides.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict[str, list[float]]
|
|
Example
Source code in lm_eval/scorers/_base.py
LLScorer
dataclass
¶
LLScorer(*, name: str, filter: FilterEnsemble, metrics: list[Metric] | None = None, context: dict[str, Any] = dict())
Bases: Scorer
flowchart TD
lm_eval.scorers.LLScorer[LLScorer]
lm_eval.scorers._base.Scorer[Scorer]
lm_eval.scorers._base.Scorer --> lm_eval.scorers.LLScorer
click lm_eval.scorers.LLScorer href "" "lm_eval.scorers.LLScorer"
click lm_eval.scorers._base.Scorer href "" "lm_eval.scorers._base.Scorer"
Scorer for loglikelihood / loglikelihood_rolling / multiple_choice tasks.
Repeats are always 1. The scalar metric result is wrapped in a
single-element list so that the downstream reduce step works
uniformly with GenScorer.
Functions¶
score_doc
¶
score_doc(doc_id: int, doc_instances: list[Instance]) -> ScoredDoc
Source code in lm_eval/scorers/_base.py
Scorer
dataclass
¶
Scorer(*, name: str, filter: FilterEnsemble, metrics: list[Metric] | None = None, context: dict[str, Any] = dict())
Base scorer defining the filter → score → reduce → aggregate pipeline.
For generation tasks, subclass GenScorer which offers two tiers of extensibility (from simplest to most control):
- Config — set
default_filter_cfg/default_metric_cfgclass variables. No scoring code needed. - Per-doc — override
GenScorer.score(reference, predictions)to return{metric: [scores]}. NoInstanceknowledge needed.
For full control (e.g. batch scoring), override score_instances.
Filter / metric precedence (highest → lowest):
- Explicit
cfg["filter"]/cfg["metric_list"]passed tofrom_dict cls.default_filter_cfg/cls.default_metric_cfg- Hardcoded fallback (
noop/ global_metrics)
Attributes¶
default_filter_cfg
class-attribute
¶
default_filter_cfg: list[dict[str, Any] | type[Filter]] | None = None
default_metric_cfg
class-attribute
¶
default_metric_cfg: list[dict[str, Any] | Metric] | None = None
raw_docs
property
¶
raw_docs: Mapping[int, ScoredDoc]
Per-document raw scoring results (pre-reduction).
Empty after [import_reduced][lm_eval.scorers.Scorer.raw_docs.import_reduced] — raw scores only exist on the rank that performed scoring.
reduced_docs
property
¶
reduced_docs: Mapping[int, ReducedDoc]
Per-document reduced results (post-reduction), ready for aggregation.
higher_is_better
property
¶
Return {metric_name: bool} for all metrics in this scorer.
Functions¶
from_dict
classmethod
¶
Build a Scorer from a normalised pipeline config.
cfg is a [_ScorerCfg][lm_eval.scorers._base._ScorerCfg] produced by
TaskConfig._normalize_scoring_config():
```python
{
"name": "strict-match",
"filter": [
{"function": "regex", "regex_pattern": "..."},
{"function": "take_first"},
],
"metric_list": [
{"metric": "exact_match", "aggregation": "mean", ...},
],
}
```
output_type is used as a last-resort fallback when neither the
config nor the class provides metrics (see
DEFAULT_METRIC_REGISTRY).
Any extra kwargs are forwarded to the constructor (e.g., custom dataclass fields on scorer subclasses).
Source code in lm_eval/scorers/_base.py
default_scorer
classmethod
¶
Build the default scorer (no explicit config).
Filter defaults to cls.default_filter_cfg if set, otherwise
noop.
Source code in lm_eval/scorers/_base.py
apply_filter
¶
score_instances
¶
score_instances(instances: Mapping[int, list[Instance]]) -> dict[int, ScoredDoc]
Score all documents' instances, returning a ScoredDoc per document.
Delegates per-document scoring to [score_doc][lm_eval.scorers.Scorer.score_instances.score_doc], which subclasses must implement.
Source code in lm_eval/scorers/_base.py
score_doc
¶
score_doc(doc_id: int, doc_instances: list[Instance]) -> ScoredDoc
Score a single document's instances. Subclasses must implement this.
Override this method to define custom scoring logic. Use [_dispatch_metrics][lm_eval.scorers.Scorer.score_doc._dispatch_metrics] as a helper to run configured metrics, or compute scores directly and return a ScoredDoc.
Source code in lm_eval/scorers/_base.py
reduce
¶
reduce(scored_docs: dict[int, ScoredDoc]) -> dict[int, ReducedDoc]
Reduce per-doc list[T] → T for each document.
Pure function: takes ScoredDoc objects (immutable raw scores)
and returns {doc_id: {metric: scalar}} dicts ready for aggregation.
For each metric in each document:
- Single value — passed through as-is (no reduction needed).
- Multiple values + reduction fn — calls
Metric.reduction(reference, values). If the reduction returns a dict, composite keys like"pass@1(metric)"are created. - No reduction fn — warns and takes the first value.
Source code in lm_eval/scorers/_base.py
aggregate
¶
aggregate(reduced_docs: Mapping[int, ReducedDoc], bootstrap_iters: int | None = 100000, aggregation_overrides: dict[str, Any] | None = None) -> tuple[dict[str, Any], int]
Aggregate reduced docs and compute stderr.
Pure function: takes {doc_id: {metric: value}} and produces
aggregated "metric,scorer" keyed results. When
aggregation_overrides is supplied (legacy Python tasks that override
Task.aggregation()), those functions take precedence over the
mean fallback for metrics not covered by a Metric object.
Returns (agg_metrics, sample_len) where keys are in
"metric,{self.name}" / "metric_stderr,{self.name}" format.
Source code in lm_eval/scorers/_base.py
export_reduced
¶
export_reduced() -> dict[int, ReducedDoc]
Export {doc_id: {metric: value}} for distributed gathering.
Since ReducedDoc is a plain dict[str, float], this is a
shallow copy. Merge across ranks is a simple dict.update
since doc IDs are unique per rank.
Source code in lm_eval/scorers/_base.py
import_reduced
¶
import_reduced(doc_data: dict[int, ReducedDoc]) -> None
Import merged results after distributed gather.
Raw scores are not available after import (they live on the source ranks).
Source code in lm_eval/scorers/_base.py
MetricKey
dataclass
¶
Structured representation of a "metric,scorer" key.
Attributes¶
parent_metric
property
¶
Extract parent from composite names: 'pass@1(exact_match)' → 'exact_match'.
Functions¶
__str__
¶
parse
classmethod
¶
parse(key: str) -> MetricKey | None
Parse a 'metric,scorer' string. Returns None if not a metric key.
Source code in lm_eval/scorers/_types.py
ScoredDoc
dataclass
¶
Immutable per-document raw scoring result.
Created by score_doc() / score_instances(). Contains per-repeat
values that haven't been reduced yet. After reduction, a
ReducedDoc is produced — ScoredDoc itself is never mutated.
ChoiceMatchScorer
dataclass
¶
ChoiceMatchScorer(*, name: str, filter: FilterEnsemble, metrics: list[Metric] | None = None, context: dict[str, Any] = dict())
Bases: GenScorer
flowchart TD
lm_eval.scorers.ChoiceMatchScorer[ChoiceMatchScorer]
lm_eval.scorers._base.GenScorer[GenScorer]
lm_eval.scorers._base.Scorer[Scorer]
lm_eval.scorers._base.GenScorer --> lm_eval.scorers.ChoiceMatchScorer
lm_eval.scorers._base.Scorer --> lm_eval.scorers._base.GenScorer
click lm_eval.scorers.ChoiceMatchScorer href "" "lm_eval.scorers.ChoiceMatchScorer"
click lm_eval.scorers._base.GenScorer href "" "lm_eval.scorers._base.GenScorer"
click lm_eval.scorers._base.Scorer href "" "lm_eval.scorers._base.Scorer"
Scorer for free-form generation scored by exact match against choices.
FirstTokenScorer
dataclass
¶
FirstTokenScorer(*, name: str, filter: FilterEnsemble, metrics: list[Metric] | None = None, context: dict[str, Any] = dict())
Bases: GenScorer
flowchart TD
lm_eval.scorers.FirstTokenScorer[FirstTokenScorer]
lm_eval.scorers._base.GenScorer[GenScorer]
lm_eval.scorers._base.Scorer[Scorer]
lm_eval.scorers._base.GenScorer --> lm_eval.scorers.FirstTokenScorer
lm_eval.scorers._base.Scorer --> lm_eval.scorers._base.GenScorer
click lm_eval.scorers.FirstTokenScorer href "" "lm_eval.scorers.FirstTokenScorer"
click lm_eval.scorers._base.GenScorer href "" "lm_eval.scorers._base.GenScorer"
click lm_eval.scorers._base.Scorer href "" "lm_eval.scorers._base.Scorer"
Scorer that strips whitespace before matching (single-token extraction).
RegexExtractionScorer
dataclass
¶
RegexExtractionScorer(*, name: str, filter: FilterEnsemble, metrics: list[Metric] | None = None, context: dict[str, Any] = dict())
Bases: GenScorer
flowchart TD
lm_eval.scorers.RegexExtractionScorer[RegexExtractionScorer]
lm_eval.scorers._base.GenScorer[GenScorer]
lm_eval.scorers._base.Scorer[Scorer]
lm_eval.scorers._base.GenScorer --> lm_eval.scorers.RegexExtractionScorer
lm_eval.scorers._base.Scorer --> lm_eval.scorers._base.GenScorer
click lm_eval.scorers.RegexExtractionScorer href "" "lm_eval.scorers.RegexExtractionScorer"
click lm_eval.scorers._base.GenScorer href "" "lm_eval.scorers._base.GenScorer"
click lm_eval.scorers._base.Scorer href "" "lm_eval.scorers._base.Scorer"
Scorer that applies regex extraction.
Functions¶
build_scorer
¶
build_scorer(cfg: _ScorerCfg | None = None, output_type: str | None = None, scorer_type: str | ScorerConfig | None = None) -> Scorer
Construct the appropriate scorer subclass.
cfg is a [_ScorerCfg][lm_eval.scorers._base._ScorerCfg] (normalised pipeline config from
TaskConfig.filter_list).
scorer_type can be:
- str — scorer name, resolved from the scorer registry
(e.g.
"first_token"→ FirstTokenScorer). - ScorerConfig dict —
{"type": "scorer_name", ...kwargs}where extra keys are forwarded to the scorer constructor as kwargs. - None — fall back to GenScorer / LLScorer based on output_type.
Metrics are resolved inside Scorer.from_dict with a 3-tier
precedence: cfg > scorer class default > DEFAULT_METRIC_REGISTRY.