lm_eval.scorers¶

Scoring pipeline components.

scorers ¶

Attributes¶

ReducedDoc `module-attribute` ¶

ReducedDoc: TypeAlias = dict[str, float]

Per-document reduced result: {metric_name: scalar_value}.

The doc_id is the key in the containing dict[int, ReducedDoc]. Created by Scorer.reduce from a ScoredDoc, or directly by Scorer.import_reduced after a distributed gather.

all `module-attribute` ¶

__all__ = ['ChoiceMatchScorer', 'FirstTokenScorer', 'GenScorer', 'LLScorer', 'MetricKey', 'ReducedDoc', 'RegexExtractionScorer', 'ScoredDoc', 'Scorer', 'build_scorer']

Classes¶

GenScorer `dataclass` ¶

GenScorer(*, name: str, filter: FilterEnsemble, metrics: list[Metric] | None = None, context: dict[str, Any] = dict())

Bases: Scorer


              flowchart TD
              lm_eval.scorers.GenScorer[GenScorer]
              lm_eval.scorers._base.Scorer[Scorer]

                              lm_eval.scorers._base.Scorer --> lm_eval.scorers.GenScorer
                


              click lm_eval.scorers.GenScorer href "" "lm_eval.scorers.GenScorer"
              click lm_eval.scorers._base.Scorer href "" "lm_eval.scorers._base.Scorer"

Scorer for generate_until tasks.

Extensibility hooks (from simplest to most control) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Tier 1 — Config-only: Set default_filter_cfg and/or default_metric_cfg class variables. No scoring code needed.

Tier 2 — Per-doc scoring: Override score to define custom scoring as (reference, predictions) → {metric: [scores]}.

Full control: Override score_instances for batch scoring (e.g. batched LLM judge calls, code sandbox pools). Use [_extract_inputs][lm_eval.scorers.GenScorer._extract_inputs] to pull (reference, predictions, metric_kwargs) from each document's instances.

The default call chain is: score_instances() → score_doc() → score()

Example

@register_scorer("ai_judge")
@dataclass
class AIJudgeScorer(GenScorer):
    judge_model: str = "claude-sonnet-4-6"

    def score_instances(self, instances):
        inputs = {did: self._extract_inputs(insts)
                  for did, insts in instances.items()}
        ratings = batch_judge(self.judge_model,
            {did: (ref, preds[0])
             for did, (ref, preds, _) in inputs.items()})
        return {
            did: ScoredDoc(
                doc_id=did, reference=ref,
                scores={"judge": [ratings[did]]})
            for did, (ref, preds, _) in inputs.items()
        }

Functions¶

score_doc ¶

score_doc(doc_id: int, doc_instances: list[Instance]) -> ScoredDoc

Extract inputs from a document's instances and delegate to [score][lm_eval.scorers.GenScorer.score_doc.score].

Source code in lm_eval/scorers/_base.py

def score_doc(self, doc_id: int, doc_instances: list[Instance]) -> ScoredDoc:
    """Extract inputs from a document's instances and delegate to [score][.score]."""
    ref, preds, mkw = self._extract_inputs(doc_instances)
    return ScoredDoc(
        doc_id=doc_id,
        reference=ref,
        scores=self.score(ref, preds, metric_kwargs=mkw),
    )

score ¶

score(reference: str | list[str], predictions: list[str], metric_kwargs: dict[str, Any] | None = None) -> dict[str, list[float]]

Per-document scoring. Override for custom generation scoring.

This is the simplest hook. Receives clean inputs and returns metric scores — no need to work with Instance or ScoredDoc.

PARAMETER	DESCRIPTION
`reference`	The gold answer(s). TYPE: `str \| list[str]`
`predictions`	Model predictions (one per repeat). TYPE: `list[str]`
`metric_kwargs`	Optional per-instance metric overrides. TYPE: `dict[str, Any] \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`dict[str, list[float]]`	`{metric_name: [score_per_repeat]}`.

Example

@register_scorer("code_exec")
@dataclass
class CodeExecScorer(GenScorer):
    timeout: int = 10

    def score(self, reference, predictions, **kwargs):
        return {
            "pass": [
                1.0 if run(code, self.timeout) == reference else 0.0
                for code in predictions
            ]
        }

Source code in lm_eval/scorers/_base.py

def score(
    self,
    reference: str | list[str],
    predictions: list[str],
    metric_kwargs: dict[str, Any] | None = None,
) -> dict[str, list[float]]:
    """Per-document scoring.  Override for custom generation scoring.

    This is the simplest hook.  Receives clean inputs and returns
    metric scores — no need to work with ``Instance`` or ``ScoredDoc``.

    Args:
        reference: The gold answer(s).
        predictions: Model predictions (one per repeat).
        metric_kwargs: Optional per-instance metric overrides.

    Returns:
        ``{metric_name: [score_per_repeat]}``.

    Example:
        ```python
        @register_scorer("code_exec")
        @dataclass
        class CodeExecScorer(GenScorer):
            timeout: int = 10

            def score(self, reference, predictions, **kwargs):
                return {
                    "pass": [
                        1.0 if run(code, self.timeout) == reference else 0.0
                        for code in predictions
                    ]
                }
        ```
    """
    return self._dispatch_metrics(
        [reference], predictions, metric_kwargs=metric_kwargs
    )

LLScorer `dataclass` ¶

LLScorer(*, name: str, filter: FilterEnsemble, metrics: list[Metric] | None = None, context: dict[str, Any] = dict())

Bases: Scorer


              flowchart TD
              lm_eval.scorers.LLScorer[LLScorer]
              lm_eval.scorers._base.Scorer[Scorer]

                              lm_eval.scorers._base.Scorer --> lm_eval.scorers.LLScorer
                


              click lm_eval.scorers.LLScorer href "" "lm_eval.scorers.LLScorer"
              click lm_eval.scorers._base.Scorer href "" "lm_eval.scorers._base.Scorer"

Scorer for loglikelihood / loglikelihood_rolling / multiple_choice tasks.

Repeats are always 1. The scalar metric result is wrapped in a single-element list so that the downstream reduce step works uniformly with GenScorer.

Functions¶

score_doc ¶

score_doc(doc_id: int, doc_instances: list[Instance]) -> ScoredDoc

Source code in lm_eval/scorers/_base.py

def score_doc(self, doc_id: int, doc_instances: list[Instance]) -> ScoredDoc:
    from lm_eval.api.metrics.results import LLResults

    metric_kwargs = doc_instances[0].metadata.get("metric_kwargs")
    results_obj = LLResults.from_instances(doc_instances, self.name)
    references = results_obj.targets
    per_doc = self._dispatch_metrics(
        references, results_obj, metric_kwargs=metric_kwargs
    )
    return ScoredDoc(
        doc_id=doc_id,
        reference=references,
        scores={mn: [v] for mn, v in per_doc.items()},
    )

Scorer `dataclass` ¶

Scorer(*, name: str, filter: FilterEnsemble, metrics: list[Metric] | None = None, context: dict[str, Any] = dict())

Base scorer defining the filter → score → reduce → aggregate pipeline.

For generation tasks, subclass GenScorer which offers two tiers of extensibility (from simplest to most control):

Config — set default_filter_cfg / default_metric_cfg class variables. No scoring code needed.
Per-doc — override GenScorer.score(reference, predictions) to return {metric: [scores]}. No Instance knowledge needed.

For full control (e.g. batch scoring), override score_instances.

Filter / metric precedence (highest → lowest):

Explicit cfg["filter"] / cfg["metric_list"] passed to from_dict
cls.default_filter_cfg / cls.default_metric_cfg
Hardcoded fallback (noop / global_metrics)

Attributes¶

default_filter_cfg `class-attribute` ¶

default_filter_cfg: list[dict[str, Any] | type[Filter]] | None = None

default_metric_cfg `class-attribute` ¶

default_metric_cfg: list[dict[str, Any] | Metric] | None = None

name `instance-attribute` ¶

name: str

filter `instance-attribute` ¶

filter: FilterEnsemble

metrics `class-attribute` `instance-attribute` ¶

metrics: list[Metric] | None = None

context `class-attribute` `instance-attribute` ¶

context: dict[str, Any] = field(default_factory=dict)

raw_docs `property` ¶

raw_docs: Mapping[int, ScoredDoc]

Per-document raw scoring results (pre-reduction).

Empty after [import_reduced][lm_eval.scorers.Scorer.raw_docs.import_reduced] — raw scores only exist on the rank that performed scoring.

reduced_docs `property` ¶

reduced_docs: Mapping[int, ReducedDoc]

Per-document reduced results (post-reduction), ready for aggregation.

higher_is_better `property` ¶

higher_is_better: dict[str, bool]

Return {metric_name: bool} for all metrics in this scorer.

Functions¶

from_dict `classmethod` ¶

from_dict(cfg: _ScorerCfg, *, output_type: str | None = None, **kwargs: Any) -> Self

Build a Scorer from a normalised pipeline config.

cfg is a [_ScorerCfg][lm_eval.scorers._base._ScorerCfg] produced by TaskConfig._normalize_scoring_config():

```python
{
    "name": "strict-match",
    "filter": [
        {"function": "regex", "regex_pattern": "..."},
        {"function": "take_first"},
    ],
    "metric_list": [
        {"metric": "exact_match", "aggregation": "mean", ...},
    ],
}
```

output_type is used as a last-resort fallback when neither the config nor the class provides metrics (see DEFAULT_METRIC_REGISTRY).

Any extra kwargs are forwarded to the constructor (e.g., custom dataclass fields on scorer subclasses).

Source code in lm_eval/scorers/_base.py

@classmethod
def from_dict(
    cls,
    cfg: _ScorerCfg,
    *,
    output_type: str | None = None,
    **kwargs: Any,
) -> Self:
    """Build a Scorer from a normalised pipeline config.

    *cfg* is a [_ScorerCfg][_ScorerCfg] produced by
    ``TaskConfig._normalize_scoring_config()``:

        ```python
        {
            "name": "strict-match",
            "filter": [
                {"function": "regex", "regex_pattern": "..."},
                {"function": "take_first"},
            ],
            "metric_list": [
                {"metric": "exact_match", "aggregation": "mean", ...},
            ],
        }
        ```

    *output_type* is used as a last-resort fallback when neither the
    config nor the class provides metrics (see
    ``DEFAULT_METRIC_REGISTRY``).

    Any extra *kwargs* are forwarded to the constructor (e.g., custom
    dataclass fields on scorer subclasses).
    """
    name = cfg.get("name", "none")
    return cls(
        name=name,
        filter=cls._build_filter(name, cfg),
        metrics=cls._build_metrics(cfg, output_type=output_type),
        **kwargs,
    )

default_scorer `classmethod` ¶

default_scorer(name: str = 'none', **kwargs: Any) -> Self

Build the default scorer (no explicit config).

Filter defaults to cls.default_filter_cfg if set, otherwise noop.

Source code in lm_eval/scorers/_base.py

@classmethod
def default_scorer(cls, name: str = "none", **kwargs: Any) -> Self:
    """Build the default scorer (no explicit config).

    Filter defaults to ``cls.default_filter_cfg`` if set, otherwise
    ``noop``.
    """
    return cls.from_dict({"name": name, "filter": [], "metric_list": []}, **kwargs)

apply_filter ¶

apply_filter(instances: list[Instance]) -> None

Source code in lm_eval/scorers/_base.py

def apply_filter(self, instances: list[Instance]) -> None:
    self.filter.apply(instances)

score_instances ¶

score_instances(instances: Mapping[int, list[Instance]]) -> dict[int, ScoredDoc]

Score all documents' instances, returning a ScoredDoc per document.

Delegates per-document scoring to [score_doc][lm_eval.scorers.Scorer.score_instances.score_doc], which subclasses must implement.

Source code in lm_eval/scorers/_base.py

def score_instances(
    self, instances: Mapping[int, list[Instance]]
) -> dict[int, ScoredDoc]:
    """Score all documents' instances, returning a ``ScoredDoc`` per document.

    Delegates per-document scoring to [score_doc][.score_doc], which subclasses
    must implement.
    """
    return {
        doc_id: self.score_doc(doc_id, doc_instances)
        for doc_id, doc_instances in instances.items()
    }

score_doc ¶

score_doc(doc_id: int, doc_instances: list[Instance]) -> ScoredDoc

Score a single document's instances. Subclasses must implement this.

Override this method to define custom scoring logic. Use [_dispatch_metrics][lm_eval.scorers.Scorer.score_doc._dispatch_metrics] as a helper to run configured metrics, or compute scores directly and return a ScoredDoc.

Source code in lm_eval/scorers/_base.py

def score_doc(self, doc_id: int, doc_instances: list[Instance]) -> ScoredDoc:
    """Score a single document's instances. Subclasses must implement this.

    Override this method to define custom scoring logic.  Use
    [_dispatch_metrics][._dispatch_metrics] as a helper to run configured metrics,
    or compute scores directly and return a [ScoredDoc][lm_eval.scorers.ScoredDoc].
    """
    raise NotImplementedError(
        f"{type(self).__name__} must implement score_doc(). "
        "Override score_doc() on your Scorer subclass, "
        "or subclass GenScorer and override score() for per-doc scoring, "
        "or override score_instances() for batch scoring."
    )

reduce ¶

reduce(scored_docs: dict[int, ScoredDoc]) -> dict[int, ReducedDoc]

Reduce per-doc list[T] → T for each document.

Pure function: takes ScoredDoc objects (immutable raw scores) and returns {doc_id: {metric: scalar}} dicts ready for aggregation.

For each metric in each document:

Single value — passed through as-is (no reduction needed).
Multiple values + reduction fn — calls Metric.reduction(reference, values). If the reduction returns a dict, composite keys like "pass@1(metric)" are created.
No reduction fn — warns and takes the first value.

Source code in lm_eval/scorers/_base.py

def reduce(self, scored_docs: dict[int, ScoredDoc]) -> dict[int, ReducedDoc]:
    """Reduce per-doc ``list[T]`` → ``T`` for each document.

    Pure function: takes [ScoredDoc][lm_eval.scorers.ScoredDoc] objects (immutable raw scores)
    and returns ``{doc_id: {metric: scalar}}`` dicts ready for aggregation.

    For each metric in each document:

    * **Single value** — passed through as-is (no reduction needed).
    * **Multiple values + reduction fn** — calls ``Metric.reduction(reference, values)``.
      If the reduction returns a dict, composite keys like ``"pass@1(metric)"``
      are created.
    * **No reduction fn** — warns and takes the first value.
    """
    metrics_by_name = self._metrics_by_name
    result: dict[int, ReducedDoc] = {}

    for sd in scored_docs.values():
        values_dict: ReducedDoc = {}
        for metric_name, score_list in sd.scores.items():
            if len(score_list) == 1:
                values_dict[metric_name] = score_list[0]
                continue
            m = metrics_by_name.get(metric_name)
            if m is not None and m.reduction is not None:
                res = m.reduction(sd.reference, score_list)
                if isinstance(res, dict):
                    for sub_metric_name, sub_value in res.items():
                        values_dict[f"{sub_metric_name}({metric_name})"] = sub_value
                else:
                    values_dict[metric_name] = res
            else:
                raise ValueError(
                    f"Metric '{metric_name}' in scorer '{self.name}' has "
                    f"{len(score_list)} values per document (repeats > 1) "
                    f"but no reduction function is configured. Set a "
                    f"reduction (e.g., 'take_first', 'pass@k', 'mean') in "
                    f"your metric config, or set repeats to 1."
                )

        result[sd.doc_id] = values_dict

    return result

aggregate ¶

aggregate(reduced_docs: Mapping[int, ReducedDoc], bootstrap_iters: int | None = 100000, aggregation_overrides: dict[str, Any] | None = None) -> tuple[dict[str, Any], int]

Aggregate reduced docs and compute stderr.

Pure function: takes {doc_id: {metric: value}} and produces aggregated "metric,scorer" keyed results. When aggregation_overrides is supplied (legacy Python tasks that override Task.aggregation()), those functions take precedence over the mean fallback for metrics not covered by a Metric object.

Returns (agg_metrics, sample_len) where keys are in "metric,{self.name}" / "metric_stderr,{self.name}" format.

Source code in lm_eval/scorers/_base.py

def aggregate(
    self,
    reduced_docs: Mapping[int, ReducedDoc],
    bootstrap_iters: int | None = 100000,
    aggregation_overrides: dict[str, Any] | None = None,
) -> tuple[dict[str, Any], int]:
    """Aggregate reduced docs and compute stderr.

    Pure function: takes ``{doc_id: {metric: value}}`` and produces
    aggregated ``"metric,scorer"`` keyed results.  When
    *aggregation_overrides* is supplied (legacy Python tasks that override
    ``Task.aggregation()``), those functions take precedence over the
    ``mean`` fallback for metrics not covered by a ``Metric`` object.

    Returns ``(agg_metrics, sample_len)`` where keys are in
    ``"metric,{self.name}"`` / ``"metric_stderr,{self.name}"`` format.
    """
    from lm_eval.api.metrics import mean, stderr_for_metric

    # Transpose doc-first → metric-first: {metric: [values]}
    results: dict[str, list[float]] = {}
    for rd in reduced_docs.values():
        for mn, val in rd.items():
            results.setdefault(mn, []).append(val)

    agg: dict[str, Any] = {}
    sample_len = 0
    metrics_by_name = self._metrics_by_name

    for metric_name, values in results.items():
        if not values:
            continue
        sample_len = max(sample_len, len(values))

        # Resolve metric object (check parent for composite keys like "pass@1(exact_match)")
        m = metrics_by_name.get(metric_name)
        if m is None and (
            parent := MetricKey(metric_name, self.name).parent_metric
        ):
            m = metrics_by_name.get(parent)

        # Resolve aggregation function: metric > legacy override > mean fallback
        if m is not None and m.aggregation is not None:
            agg_fn = m.aggregation
        elif aggregation_overrides and metric_name in aggregation_overrides:
            agg_fn = aggregation_overrides[metric_name]
        else:
            eval_logger.error(
                "No aggregation function for metric '%s' in scorer '%s'. "
                "Defaulting to 'mean'. WARNING: this will produce INCORRECT "
                "results for corpus-level metrics (BLEU, perplexity, F1, etc.). "
                "Set 'aggregation' explicitly in your metric config.",
                metric_name,
                self.name,
            )
            agg_fn = mean

        # Aggregate + stderr
        key = str(MetricKey(metric_name, self.name))
        agg[key] = agg_fn(values)

        stderr_key = str(MetricKey(metric_name, self.name, is_stderr=True))
        if isinstance(bootstrap_iters, int) and bootstrap_iters > 0:
            stderr_fn = stderr_for_metric(
                metric=agg_fn, bootstrap_iters=bootstrap_iters
            )
            agg[stderr_key] = (
                stderr_fn(values) if (stderr_fn and len(values) > 1) else "N/A"
            )
        else:
            agg[stderr_key] = "N/A"

    return agg, sample_len

export_reduced ¶

export_reduced() -> dict[int, ReducedDoc]

Export {doc_id: {metric: value}} for distributed gathering.

Since ReducedDoc is a plain dict[str, float], this is a shallow copy. Merge across ranks is a simple dict.update since doc IDs are unique per rank.

Source code in lm_eval/scorers/_base.py

def export_reduced(self) -> dict[int, ReducedDoc]:
    """Export ``{doc_id: {metric: value}}`` for distributed gathering.

    Since ``ReducedDoc`` is a plain ``dict[str, float]``, this is a
    shallow copy.  Merge across ranks is a simple ``dict.update``
    since doc IDs are unique per rank.
    """
    return dict(self._reduced_docs)

import_reduced ¶

import_reduced(doc_data: dict[int, ReducedDoc]) -> None

Import merged results after distributed gather.

Raw scores are not available after import (they live on the source ranks).

Source code in lm_eval/scorers/_base.py

def import_reduced(self, doc_data: dict[int, ReducedDoc]) -> None:
    """Import merged results after distributed gather.

    Raw scores are not available after import (they live on the
    source ranks).
    """
    self._raw_docs = {}
    self._reduced_docs = dict(doc_data)

set_results ¶

set_results(scored_docs: dict[int, ScoredDoc]) -> None

Store raw scored documents and compute reduction.

Source code in lm_eval/scorers/_base.py

def set_results(self, scored_docs: dict[int, ScoredDoc]) -> None:
    """Store raw scored documents and compute reduction."""
    self._raw_docs = scored_docs
    self._reduced_docs = self.reduce(scored_docs)

MetricKey `dataclass` ¶

MetricKey(metric: str, scorer: str, is_stderr: bool = False)

Structured representation of a "metric,scorer" key.

Attributes¶

metric `instance-attribute` ¶

metric: str

scorer `instance-attribute` ¶

scorer: str

is_stderr `class-attribute` `instance-attribute` ¶

is_stderr: bool = False

parent_metric `property` ¶

parent_metric: str | None

Extract parent from composite names: 'pass@1(exact_match)' → 'exact_match'.

Functions¶

str ¶

__str__() -> str

Source code in lm_eval/scorers/_types.py

def __str__(self) -> str:
    name = f"{self.metric}_stderr" if self.is_stderr else self.metric
    return f"{name},{self.scorer}"

parse `classmethod` ¶

parse(key: str) -> MetricKey | None

Parse a 'metric,scorer' string. Returns None if not a metric key.

Source code in lm_eval/scorers/_types.py

@classmethod
def parse(cls, key: str) -> MetricKey | None:
    """Parse a ``'metric,scorer'`` string. Returns ``None`` if not a metric key."""
    if "," not in key:
        return None
    left, _, scorer = key.partition(",")
    if left.endswith("_stderr"):
        return cls(metric=left[: -len("_stderr")], scorer=scorer, is_stderr=True)
    return cls(metric=left, scorer=scorer)

ScoredDoc `dataclass` ¶

ScoredDoc(doc_id: int, reference: Reference, scores: dict[str, list[float]])

Immutable per-document raw scoring result.

Created by score_doc() / score_instances(). Contains per-repeat values that haven't been reduced yet. After reduction, a ReducedDoc is produced — ScoredDoc itself is never mutated.

Attributes¶

doc_id `instance-attribute` ¶

doc_id: int

reference `instance-attribute` ¶

reference: Reference

scores `instance-attribute` ¶

scores: dict[str, list[float]]

Functions¶

ChoiceMatchScorer `dataclass` ¶

ChoiceMatchScorer(*, name: str, filter: FilterEnsemble, metrics: list[Metric] | None = None, context: dict[str, Any] = dict())

Bases: GenScorer


              flowchart TD
              lm_eval.scorers.ChoiceMatchScorer[ChoiceMatchScorer]
              lm_eval.scorers._base.GenScorer[GenScorer]
              lm_eval.scorers._base.Scorer[Scorer]

                              lm_eval.scorers._base.GenScorer --> lm_eval.scorers.ChoiceMatchScorer
                                lm_eval.scorers._base.Scorer --> lm_eval.scorers._base.GenScorer
                



              click lm_eval.scorers.ChoiceMatchScorer href "" "lm_eval.scorers.ChoiceMatchScorer"
              click lm_eval.scorers._base.GenScorer href "" "lm_eval.scorers._base.GenScorer"
              click lm_eval.scorers._base.Scorer href "" "lm_eval.scorers._base.Scorer"

Scorer for free-form generation scored by exact match against choices.

Attributes¶

default_metric_cfg `class-attribute` ¶

default_metric_cfg: list[dict[str, Any]] = _EXACT_MATCH_METRIC

Functions¶

FirstTokenScorer `dataclass` ¶

FirstTokenScorer(*, name: str, filter: FilterEnsemble, metrics: list[Metric] | None = None, context: dict[str, Any] = dict())

Bases: GenScorer


              flowchart TD
              lm_eval.scorers.FirstTokenScorer[FirstTokenScorer]
              lm_eval.scorers._base.GenScorer[GenScorer]
              lm_eval.scorers._base.Scorer[Scorer]

                              lm_eval.scorers._base.GenScorer --> lm_eval.scorers.FirstTokenScorer
                                lm_eval.scorers._base.Scorer --> lm_eval.scorers._base.GenScorer
                



              click lm_eval.scorers.FirstTokenScorer href "" "lm_eval.scorers.FirstTokenScorer"
              click lm_eval.scorers._base.GenScorer href "" "lm_eval.scorers._base.GenScorer"
              click lm_eval.scorers._base.Scorer href "" "lm_eval.scorers._base.Scorer"

Scorer that strips whitespace before matching (single-token extraction).

Attributes¶

default_filter_cfg `class-attribute` ¶

default_filter_cfg: list[dict[str, Any]] = [{'function': 'remove_whitespace'}]

default_metric_cfg `class-attribute` ¶

default_metric_cfg: list[dict[str, Any]] = _EXACT_MATCH_METRIC

Functions¶

RegexExtractionScorer `dataclass` ¶

RegexExtractionScorer(*, name: str, filter: FilterEnsemble, metrics: list[Metric] | None = None, context: dict[str, Any] = dict())

Bases: GenScorer


              flowchart TD
              lm_eval.scorers.RegexExtractionScorer[RegexExtractionScorer]
              lm_eval.scorers._base.GenScorer[GenScorer]
              lm_eval.scorers._base.Scorer[Scorer]

                              lm_eval.scorers._base.GenScorer --> lm_eval.scorers.RegexExtractionScorer
                                lm_eval.scorers._base.Scorer --> lm_eval.scorers._base.GenScorer
                



              click lm_eval.scorers.RegexExtractionScorer href "" "lm_eval.scorers.RegexExtractionScorer"
              click lm_eval.scorers._base.GenScorer href "" "lm_eval.scorers._base.GenScorer"
              click lm_eval.scorers._base.Scorer href "" "lm_eval.scorers._base.Scorer"

Scorer that applies regex extraction.

Attributes¶

default_filter_cfg `class-attribute` ¶

default_filter_cfg: list[dict[str, Any]] = [{'function': 'regex'}]

default_metric_cfg `class-attribute` ¶

default_metric_cfg: list[dict[str, Any]] = _EXACT_MATCH_METRIC

Functions¶

build_scorer ¶

build_scorer(cfg: _ScorerCfg | None = None, output_type: str | None = None, scorer_type: str | ScorerConfig | None = None) -> Scorer

Construct the appropriate scorer subclass.

cfg is a [_ScorerCfg][lm_eval.scorers._base._ScorerCfg] (normalised pipeline config from TaskConfig.filter_list).

scorer_type can be:

str — scorer name, resolved from the scorer registry (e.g. "first_token" → FirstTokenScorer).
ScorerConfig dict — {"type": "scorer_name", ...kwargs} where extra keys are forwarded to the scorer constructor as kwargs.
None — fall back to GenScorer / LLScorer based on output_type.

Metrics are resolved inside Scorer.from_dict with a 3-tier precedence: cfg > scorer class default > DEFAULT_METRIC_REGISTRY.

Source code in lm_eval/scorers/_base.py

def build_scorer(
    cfg: _ScorerCfg | None = None,
    output_type: str | None = None,
    scorer_type: str | ScorerConfig | None = None,
) -> Scorer:
    """Construct the appropriate scorer subclass.

    *cfg* is a [_ScorerCfg][_ScorerCfg] (normalised pipeline config from
    ``TaskConfig.filter_list``).

    *scorer_type* can be:

    * **str** — scorer name, resolved from the scorer registry
      (e.g. ``"first_token"`` → [FirstTokenScorer][lm_eval.scorers.extraction.FirstTokenScorer]).
    * [ScorerConfig][lm_eval.config.task.ScorerConfig] **dict** —
      ``{"type": "scorer_name", ...kwargs}`` where extra keys are
      forwarded to the scorer constructor as kwargs.
    * **None** — fall back to [GenScorer][GenScorer] / [LLScorer][LLScorer]
      based on *output_type*.

    Metrics are resolved inside ``Scorer.from_dict`` with a 3-tier
    precedence: cfg > scorer class default > DEFAULT_METRIC_REGISTRY.
    """
    scorer_kwargs: dict[str, Any] = {}
    scorer_name: str | None = None

    if isinstance(scorer_type, dict):
        scorer_name = scorer_type["type"]
        scorer_kwargs = scorer_type.get("kwargs", {})
    elif isinstance(scorer_type, str):
        scorer_name = scorer_type

    if scorer_name is not None:
        from lm_eval.api.registry import get_scorer

        cls = get_scorer(scorer_name)
    elif output_type == "generate_until":
        cls = GenScorer
    elif output_type in (
        "loglikelihood",
        "loglikelihood_rolling",
        "multiple_choice",
    ):
        cls = LLScorer
    else:
        raise ValueError(
            f"Cannot infer scorer for output_type={output_type!r}. "
            f"Pass an explicit scorer_type or use a known output_type."
        )

    if cfg is None:
        cfg = {"name": scorer_name or "none", "filter": [], "metric_list": []}
    return cls.from_dict(cfg, output_type=output_type, **scorer_kwargs)

lm_eval.scorers¶

scorers ¶

Attributes¶

ReducedDoc module-attribute ¶

__all__ module-attribute ¶

Classes¶

GenScorer dataclass ¶

Functions¶

score_doc ¶

score ¶

LLScorer dataclass ¶

Functions¶

score_doc ¶

Scorer dataclass ¶

Attributes¶

default_filter_cfg class-attribute ¶

default_metric_cfg class-attribute ¶

name instance-attribute ¶

filter instance-attribute ¶

metrics class-attribute instance-attribute ¶

context class-attribute instance-attribute ¶

raw_docs property ¶

reduced_docs property ¶

higher_is_better property ¶

Functions¶

from_dict classmethod ¶

default_scorer classmethod ¶

apply_filter ¶

score_instances ¶

score_doc ¶

reduce ¶

aggregate ¶

export_reduced ¶

import_reduced ¶

set_results ¶

MetricKey dataclass ¶

Attributes¶

metric instance-attribute ¶

scorer instance-attribute ¶

is_stderr class-attribute instance-attribute ¶

parent_metric property ¶

Functions¶

__str__ ¶

parse classmethod ¶

ScoredDoc dataclass ¶

Attributes¶

doc_id instance-attribute ¶

reference instance-attribute ¶

scores instance-attribute ¶

Functions¶

ChoiceMatchScorer dataclass ¶

Attributes¶

default_metric_cfg class-attribute ¶

Functions¶

FirstTokenScorer dataclass ¶

Attributes¶

default_filter_cfg class-attribute ¶

default_metric_cfg class-attribute ¶

Functions¶

RegexExtractionScorer dataclass ¶

Attributes¶

default_filter_cfg class-attribute ¶

default_metric_cfg class-attribute ¶

Functions¶

Functions¶

build_scorer ¶

ReducedDoc `module-attribute` ¶

all `module-attribute` ¶

GenScorer `dataclass` ¶

LLScorer `dataclass` ¶

Scorer `dataclass` ¶

default_filter_cfg `class-attribute` ¶

default_metric_cfg `class-attribute` ¶

name `instance-attribute` ¶

filter `instance-attribute` ¶

metrics `class-attribute` `instance-attribute` ¶

context `class-attribute` `instance-attribute` ¶

raw_docs `property` ¶

reduced_docs `property` ¶

higher_is_better `property` ¶

from_dict `classmethod` ¶

default_scorer `classmethod` ¶

MetricKey `dataclass` ¶

metric `instance-attribute` ¶

scorer `instance-attribute` ¶

is_stderr `class-attribute` `instance-attribute` ¶

parent_metric `property` ¶

str ¶

parse `classmethod` ¶

ScoredDoc `dataclass` ¶

doc_id `instance-attribute` ¶

reference `instance-attribute` ¶

scores `instance-attribute` ¶

ChoiceMatchScorer `dataclass` ¶

default_metric_cfg `class-attribute` ¶

FirstTokenScorer `dataclass` ¶

default_filter_cfg `class-attribute` ¶

default_metric_cfg `class-attribute` ¶

RegexExtractionScorer `dataclass` ¶

default_filter_cfg `class-attribute` ¶

default_metric_cfg `class-attribute` ¶