Metrics¶
The metrics system provides per-sample scoring, corpus-level aggregation, and reduction functions for handling repeated evaluations.
Built-in Metrics¶
Loglikelihood Metrics¶
acc
¶
acc(references: int | list[int], predictions: LLResults, multiple_targets=False) -> int
Accuracy.
For multiple-choice (multiple lls): 1 if argmax(lls) matches gold. For a single loglikelihood (one ll): 1 if the continuation was decoded greedily.
Source code in lm_eval/api/metrics/ll.py
acc_norm
¶
acc_norm(references: int | list[int], predictions: LLResults, multiple_targets=False) -> int
Character-length-normalised accuracy: picks the choice with the highest ll / char_len.
Source code in lm_eval/api/metrics/ll.py
acc_bytes
¶
acc_bytes(references: int | list[int], predictions: LLResults, multiple_targets=False) -> int
Byte-length-normalised accuracy: picks the choice with the highest ll / byte_len.
Source code in lm_eval/api/metrics/ll.py
acc_mutual_info_fn
¶
acc_mutual_info_fn(references: int | list[int], predictions: LLResults, multiple_targets=False) -> int
Mutual-information-weighted accuracy: picks the choice with the highest ll - ll_unconditional.
Source code in lm_eval/api/metrics/ll.py
exact_match_mc
¶
exact_match_mc(references: int | list[int], predictions: LLResults) -> int
1 if the gold completion was decoded greedily (every token was argmax), else 0.
Source code in lm_eval/api/metrics/ll.py
bpb
¶
bpb(references: int, predictions: LLResults) -> float
Bits-per-byte of the gold completion: -ll[gold] / byte_len[gold] * NAT_TO_BIT.
Lower is better — measures how many bits the model needs per byte of the correct answer.
Source code in lm_eval/api/metrics/ll.py
logprob_fn
¶
logprob_fn(references: int, predictions: LLResults) -> float
Raw log-probability of the gold completion (in nats).
Source code in lm_eval/api/metrics/ll.py
brier_score
¶
brier_score(references: int, predictions: LLResults) -> float
Per-sample Brier score: sum of squared errors between softmax probs and one-hot gold.
Source code in lm_eval/api/metrics/ll.py
Generation Metrics¶
exact_match_fn
¶
exact_match_fn(references: list[str] | list[list[str]], predictions: list[str], multiple_targets: bool = False, **kwargs) -> dict[str, list[int]]
Source code in lm_eval/api/metrics/generation.py
Aggregation Functions¶
mean
¶
median
¶
nanmean
¶
weighted_mean
¶
perplexity
¶
weighted_perplexity
¶
bits_per_byte
¶
Corpus-Level Metrics¶
Metrics that must operate across the entire corpus rather than per-sample.
Perplexity
¶
Bases: CorpusMetric['LLResults', float]
flowchart TD
lm_eval.api.metrics.Perplexity[Perplexity]
lm_eval.api.metrics.corpus.CorpusMetric[CorpusMetric]
lm_eval.api.metrics.corpus.CorpusMetric --> lm_eval.api.metrics.Perplexity
click lm_eval.api.metrics.Perplexity href "" "lm_eval.api.metrics.Perplexity"
click lm_eval.api.metrics.corpus.CorpusMetric href "" "lm_eval.api.metrics.corpus.CorpusMetric"
Corpus-level perplexity for loglikelihood tasks.
Per-document: extracts the gold log-likelihood.
Aggregation: exp(-mean(lls)) across all documents.
WordPerplexity
¶
Bases: CorpusMetric['LLResults', tuple[float, int]]
flowchart TD
lm_eval.api.metrics.WordPerplexity[WordPerplexity]
lm_eval.api.metrics.corpus.CorpusMetric[CorpusMetric]
lm_eval.api.metrics.corpus.CorpusMetric --> lm_eval.api.metrics.WordPerplexity
click lm_eval.api.metrics.WordPerplexity href "" "lm_eval.api.metrics.WordPerplexity"
click lm_eval.api.metrics.corpus.CorpusMetric href "" "lm_eval.api.metrics.corpus.CorpusMetric"
Corpus-level word perplexity for rolling loglikelihood tasks.
Computes the exponentiated average negative log-likelihood per word across all documents, weighted by word count.
Lower scores are better.
BytePerplexity
¶
Bases: CorpusMetric['LLResults', tuple[float, int]]
flowchart TD
lm_eval.api.metrics.BytePerplexity[BytePerplexity]
lm_eval.api.metrics.corpus.CorpusMetric[CorpusMetric]
lm_eval.api.metrics.corpus.CorpusMetric --> lm_eval.api.metrics.BytePerplexity
click lm_eval.api.metrics.BytePerplexity href "" "lm_eval.api.metrics.BytePerplexity"
click lm_eval.api.metrics.corpus.CorpusMetric href "" "lm_eval.api.metrics.corpus.CorpusMetric"
Corpus-level byte perplexity for rolling loglikelihood tasks.
Computes the exponentiated average negative log-likelihood per byte across all documents, weighted by byte count.
Lower scores are better.
BitsPerByte
¶
Bases: CorpusMetric['LLResults', tuple[float, int]]
flowchart TD
lm_eval.api.metrics.BitsPerByte[BitsPerByte]
lm_eval.api.metrics.corpus.CorpusMetric[CorpusMetric]
lm_eval.api.metrics.corpus.CorpusMetric --> lm_eval.api.metrics.BitsPerByte
click lm_eval.api.metrics.BitsPerByte href "" "lm_eval.api.metrics.BitsPerByte"
click lm_eval.api.metrics.corpus.CorpusMetric href "" "lm_eval.api.metrics.corpus.CorpusMetric"
Corpus-level bits-per-byte for rolling loglikelihood tasks.
Converts the average negative log-likelihood per byte into bits by dividing by log(2), weighted by byte count across all documents.
Lower scores are better.
Bleu
¶
Bases: _SacrebleuCorpusMetric
flowchart TD
lm_eval.api.metrics.Bleu[Bleu]
lm_eval.api.metrics.corpus._SacrebleuCorpusMetric[_SacrebleuCorpusMetric]
lm_eval.api.metrics.corpus.CorpusMetric[CorpusMetric]
lm_eval.api.metrics.corpus._SacrebleuCorpusMetric --> lm_eval.api.metrics.Bleu
lm_eval.api.metrics.corpus.CorpusMetric --> lm_eval.api.metrics.corpus._SacrebleuCorpusMetric
click lm_eval.api.metrics.Bleu href "" "lm_eval.api.metrics.Bleu"
click lm_eval.api.metrics.corpus._SacrebleuCorpusMetric href "" "lm_eval.api.metrics.corpus._SacrebleuCorpusMetric"
click lm_eval.api.metrics.corpus.CorpusMetric href "" "lm_eval.api.metrics.corpus.CorpusMetric"
BLEU score for generated text.
The Bilingual Evaluation Understudy Score counts matching n-grams in the candidate translation to n-grams in the reference text.
Higher is better.
Chrf
¶
Bases: _SacrebleuCorpusMetric
flowchart TD
lm_eval.api.metrics.Chrf[Chrf]
lm_eval.api.metrics.corpus._SacrebleuCorpusMetric[_SacrebleuCorpusMetric]
lm_eval.api.metrics.corpus.CorpusMetric[CorpusMetric]
lm_eval.api.metrics.corpus._SacrebleuCorpusMetric --> lm_eval.api.metrics.Chrf
lm_eval.api.metrics.corpus.CorpusMetric --> lm_eval.api.metrics.corpus._SacrebleuCorpusMetric
click lm_eval.api.metrics.Chrf href "" "lm_eval.api.metrics.Chrf"
click lm_eval.api.metrics.corpus._SacrebleuCorpusMetric href "" "lm_eval.api.metrics.corpus._SacrebleuCorpusMetric"
click lm_eval.api.metrics.corpus.CorpusMetric href "" "lm_eval.api.metrics.corpus.CorpusMetric"
chrF++ score for generated text.
chrF++ is based on character n-gram precision and recall enhanced with word n-grams.
Higher is better.
Ter
¶
Bases: _SacrebleuCorpusMetric
flowchart TD
lm_eval.api.metrics.Ter[Ter]
lm_eval.api.metrics.corpus._SacrebleuCorpusMetric[_SacrebleuCorpusMetric]
lm_eval.api.metrics.corpus.CorpusMetric[CorpusMetric]
lm_eval.api.metrics.corpus._SacrebleuCorpusMetric --> lm_eval.api.metrics.Ter
lm_eval.api.metrics.corpus.CorpusMetric --> lm_eval.api.metrics.corpus._SacrebleuCorpusMetric
click lm_eval.api.metrics.Ter href "" "lm_eval.api.metrics.Ter"
click lm_eval.api.metrics.corpus._SacrebleuCorpusMetric href "" "lm_eval.api.metrics.corpus._SacrebleuCorpusMetric"
click lm_eval.api.metrics.corpus.CorpusMetric href "" "lm_eval.api.metrics.corpus.CorpusMetric"
Translation Error Rate for generated text.
Measures the number of edits required to change a system output into one of the references.
Lower is better.
F1
¶
Bases: CorpusMetric['LLResults', tuple[int, int]]
flowchart TD
lm_eval.api.metrics.F1[F1]
lm_eval.api.metrics.corpus.CorpusMetric[CorpusMetric]
lm_eval.api.metrics.corpus.CorpusMetric --> lm_eval.api.metrics.F1
click lm_eval.api.metrics.F1 href "" "lm_eval.api.metrics.F1"
click lm_eval.api.metrics.corpus.CorpusMetric href "" "lm_eval.api.metrics.corpus.CorpusMetric"
F1 score for multiple choice tasks.
Computes the maximum F1 score between gold labels and predicted labels (argmax of log-likelihoods).
Higher is better.
MCC
¶
Bases: CorpusMetric['LLResults', tuple[int, int]]
flowchart TD
lm_eval.api.metrics.MCC[MCC]
lm_eval.api.metrics.corpus.CorpusMetric[CorpusMetric]
lm_eval.api.metrics.corpus.CorpusMetric --> lm_eval.api.metrics.MCC
click lm_eval.api.metrics.MCC href "" "lm_eval.api.metrics.MCC"
click lm_eval.api.metrics.corpus.CorpusMetric href "" "lm_eval.api.metrics.corpus.CorpusMetric"
Matthews Correlation Coefficient for multiple choice tasks.
Computes MCC between gold labels and predicted labels (argmax of log-likelihoods).
Higher is better.
Stderr Functions¶
stderr_for_metric
¶
stderr_for_metric(metric: Callable[[Sequence[T]], float], bootstrap_iters: int) -> Callable[[Sequence[T]], float] | None
Return a function that estimates the standard error of metric(xs).
meanhas a closed-form SE (sample_stddev / sqrt(n)).- All other aggregations use
bootstrap_stderrwithbootstrap_itersdraws. - Returns
Nonewhenbootstrap_iters <= 0.
Source code in lm_eval/api/metrics/stderr.py
bootstrap_stderr
¶
Bootstrap estimate of the standard error of statistic f(xs) using up to iters resamples, chunked (≤ 1000 draws).
Executes in parallel unless LMEVAL_DISABLE_MULTIPROC is set;
Source code in lm_eval/api/metrics/stderr.py
mean_stderr
¶
Types¶
The core metric wrapper. Each metric defines a per-sample function, an aggregation strategy, and optionally a reduction for repeated samples.
Metric
dataclass
¶
Metric(name: str, fn: MetricFn[_T], kwargs: Mapping[str, Any] = dict(), aggregation: AggregationFn[_K] | None = None, higher_is_better: bool = True, output_type: set[str] = (lambda: {'multiple_choice'})(), reduction: ReductionFn[_T, _K] | None = take_first)
Bases: Generic[_T, _K]
flowchart TD
lm_eval.api.metrics.Metric[Metric]
click lm_eval.api.metrics.Metric href "" "lm_eval.api.metrics.Metric"
Encapsulates information about a single metric.
This is the canonical representation for metrics used throughout lm_eval.
| CLASS TYPE PARAMETER | DESCRIPTION |
|---|---|
_T
|
Per-sample result type from
|
_K
|
Reduced type after collapsing repeats via
|
Type chain: fn(...) -> _T, reduction(...) -> _K, aggregation(Sequence[_K]) -> float.
Attributes¶
output_type
class-attribute
instance-attribute
¶
Functions¶
__post_init__
¶
Source code in lm_eval/api/metrics/metric.py
from_dict
classmethod
¶
from_dict(cfg: dict[str, Any] | MetricConfig, output_type: str | None = None) -> Metric[Any, Any]
Source code in lm_eval/api/metrics/metric.py
compute
¶
aggregate
¶
Aggregate a list of metric values into a single score.
Source code in lm_eval/api/metrics/metric.py
MetricFn
¶
Bases: Protocol[_T]
flowchart TD
lm_eval.api.metrics.MetricFn[MetricFn]
click lm_eval.api.metrics.MetricFn href "" "lm_eval.api.metrics.MetricFn"
Callable that computes a per-sample metric value.
AggregationFn
¶
Bases: Protocol[_K]
flowchart TD
lm_eval.api.metrics.AggregationFn[AggregationFn]
click lm_eval.api.metrics.AggregationFn href "" "lm_eval.api.metrics.AggregationFn"
Callable that aggregates per-document values into a corpus-level float.
ReductionFn
¶
Bases: Protocol[_T, _K]
flowchart TD
lm_eval.api.metrics.ReductionFn[ReductionFn]
click lm_eval.api.metrics.ReductionFn href "" "lm_eval.api.metrics.ReductionFn"
Callable that reduces per-repeat scores into one value per document.
CorpusMetric
¶
Bases: ABC, Generic[_R, _T]
flowchart TD
lm_eval.api.metrics.CorpusMetric[CorpusMetric]
click lm_eval.api.metrics.CorpusMetric href "" "lm_eval.api.metrics.CorpusMetric"
Base class for corpus-level metrics.
Corpus-level metrics are computed across multiple samples and typically require aggregation of intermediate results.
Data flow
__call__(references, predictions: _R) -> _T # per document intermediate result
aggregation(list[_T]) -> float # corpus level
Functions¶
__call__
abstractmethod
¶
aggregation
abstractmethod
¶
reduce
¶
Collapse multiple repeats of a sample into one value. Corpus metrics only support repeat=1.
Source code in lm_eval/api/metrics/corpus.py
LLResults
dataclass
¶
LLResults(results: list[Any], targets: int | list[int] | str | list[str], ctx: str = '', choices: Sequence[str] = list(), lls_mutual_info: NDArray[float64] = _empty_array(), metadata: dict[str, Any] = dict(), *, lls: NDArray[float64], is_greedy: Sequence[bool])
Per-doc bundle of log-likelihoods, greedy flags, and choices for loglikelihood tasks.
Built via from_instances from all LLInstances sharing a doc_id,
and passed as predictions to metrics in LLScorer.