Migration Guide: v0.4 → v0.5¶

This guide covers the key changes in v0.5 of the LM Evaluation Harness and how to update your code.

Summary of changes¶

Area	What changed
Task loading	`TaskManager.load()` replaces `get_task_dict()` / `load_task_or_group()`
Groups	`Group` dataclass replaces `ConfigurableGroup`
Scoring	New `Scorer` pipeline replaces direct metric processing
Metrics	`Metric[_T, _K]` generic dataclass, modular metric system
Formats	New declarative prompt format system (additive, not breaking)
Instances	Generic `Instance[InputT, OutputT]` with typed aliases
Config	`TaskConfig` is now a proper dataclass with TypedDict configs

Task loading¶

Before (v0.4)¶

from lm_eval.tasks import get_task_dict, TaskManager

task_manager = TaskManager()
task_dict = get_task_dict(["hellaswag", "arc_easy"], task_manager)
# task_dict is a nested dict with ConfigurableGroup keys

After (v0.5)¶

from lm_eval.tasks import TaskManager

tm = TaskManager()
loaded = tm.load(["hellaswag", "arc_easy"])

loaded["tasks"]      # {"hellaswag": Task, "arc_easy": Task}
loaded["groups"]     # {}
loaded["group_map"]  # {}

Key differences:

load() returns a TaskDict TypedDict with flat tasks, groups, and group_map keys
Supports runtime overrides: tm.load(["arc_easy"], overrides={"arc_easy": {"num_fewshot": 5}})
Supports inline config dicts and the @format / :: path syntax
get_task_dict() and load_task_or_group() are deprecated but still work

Groups¶

Before (v0.4)¶

from lm_eval.api.group import ConfigurableGroup

# ConfigurableGroup with custom __eq__/__hash__ based on group_name
# Used as dict keys in nested task dicts

After (v0.5)¶

from lm_eval.api.group import Group

# Group is a @dataclass — simpler, no custom __eq__/__hash__
# Returned in TaskDict["groups"] as a flat dict, not as dict keys

ConfigurableGroup is deprecated. Group is a straightforward dataclass.

Scoring pipeline¶

Before (v0.4)¶

Metrics were processed directly in the evaluator via Task.process_results() and Task.aggregation(). Filter pipelines were managed separately.

After (v0.5)¶

The Scorer class encapsulates the full pipeline: filter → score → reduce → aggregate.

from lm_eval.scorers import Scorer, GenScorer, LLScorer, build_scorer

Scorer hierarchy:

Scorer — abstract base
GenScorer — for generate_until tasks (with 3 extensibility tiers)
LLScorer — for loglikelihood / multiple_choice tasks
Built-in scorers: ChoiceMatchScorer, FirstTokenScorer, RegexExtractionScorer

For YAML task authors: No changes needed for simple tasks. The task YAML fields (metric_list, filter_list) still work and are automatically routed through the scorer system.

For Python API users: If you were calling Task.process_results() directly, use the scorer pipeline instead.

Metrics¶

Before (v0.4)¶

# Monolithic lm_eval/api/metrics.py
from lm_eval.api.metrics import register_metric, register_aggregation

After (v0.5)¶

# Modular lm_eval/api/metrics/ package
from lm_eval.api.metrics import Metric, register_metric, register_aggregation

The Metric class is now a generic frozen dataclass:

@dataclass(frozen=True)
class Metric(Generic[_T, _K]):
    name: str
    fn: MetricFn[_T]
    kwargs: Mapping[str, Any] = field(default_factory=dict)
    aggregation: AggregationFn[_K] | None = None
    higher_is_better: bool = True
    output_type: set[str] = field(default_factory=lambda: {"multiple_choice"})
    reduction: ReductionFn[_T, _K] | None = take_first

Type chain: fn(...) → _T, reduction(...) → _K, aggregation(Sequence[_K]) → float.

The metrics module is now split into sub-modules: metric.py, aggregations.py, corpus.py, generation.py, ll.py, reduce.py, stderr.py, utils.py.

For most users: register_metric and register_aggregation still work the same way.

Formats (new feature)¶

The declarative prompt format system is entirely new — no migration needed, but it's one of the biggest quality-of-life improvements in v0.5. Formats let task authors skip Jinja entirely for common prompt patterns.

Before (v0.4) — manual Jinja for a standard MCQA task:

output_type: multiple_choice
doc_to_text: "Question: {{question}}\n{% for letter, choice in zip(['A','B','C','D'], choices) %}{{letter}}. {{choice}}\n{% endfor %}Answer:"
doc_to_target: "{{['A','B','C','D'][answer]}}"
doc_to_choice: "{{choices}}"

After (v0.5) — same result with a format:

doc_to_text: question
doc_to_target: answer
doc_to_choice: choices
formats: mcqa

The format auto-generates the Jinja templates, sets output_type, configures delimiters, and wires up scoring.

Runtime format selection — try different prompt styles without touching YAML:

lm-eval run --tasks my_task@mcqa --model hf --model_args pretrained=gpt2
lm-eval run --tasks my_task@generate --model hf --model_args pretrained=gpt2
lm-eval run --tasks my_task@cloze --model hf --model_args pretrained=gpt2

Built-in formats: mcqa, cloze, generate, cot. See Prompt Formats for the full guide.

Instance types¶

Before (v0.4)¶

from lm_eval.api.instance import Instance

# Instance with metadata as a tuple: (task_name, doc_id, repeats)
inst = Instance(
    request_type="loglikelihood",
    doc=doc,
    arguments=("context", "continuation"),
    metadata=(task_name, doc_id, repeats),
)

After (v0.5)¶

from lm_eval.api.instance import Instance, LLInstance, GenInstance

# Instance is generic with explicit fields
inst = Instance(
    request_type="loglikelihood",
    doc=doc,
    arguments=("context", "continuation"),
    task_name=task_name,
    doc_id=doc_id,
    repeats=repeats,
    target=gold_answer,
)

Key changes:

Instance[InputT, OutputT] is now generic
task_name, doc_id, repeats are explicit fields (not packed in a metadata tuple)
New target field for gold references
New additional_args field for multimodal support
metadata is now a dict[str, Any] (not a tuple)
Type aliases: LLInstance, GenInstance
Backward compatibility: if you pass a tuple as metadata, __post_init__ unpacks it

Type aliases¶

New type aliases in lm_eval.api._types:

Type	Definition
`Doc`	`dict[str, Any]`
`DataSplit`	`datasets.Dataset \\| Sequence[Doc]`
`Dataset`	`Mapping[str, DataSplit] \\| datasets.DatasetDict`
`Context`	`str \\| list[dict[str, str]]`
`LLArgs`	`tuple[str, str]`
`LLOutput`	`tuple[float, bool]`
`GenArgs`	`tuple[Context, GenKwargs]`
`Completion`	`str`
`Reference`	`str \\| list[str] \\| int \\| list[int] \\| None`

TaskConfig¶

TaskConfig is now a proper dataclass (in lm_eval.config.task) with typed config dictionaries:

MetricConfig — TypedDict for metric entries in metric_list
FilterStep — TypedDict for filter steps in filter_list
ScorerConfig — TypedDict for scorer configuration
FilterPipeline — A named pipeline with filters and optional per-pipeline metrics

What still works from v0.4¶

YAML task configs are fully backward compatible
register_metric, register_aggregation, register_filter, register_model decorators work the same
simple_evaluate() API is unchanged
--apply_chat_template, --fewshot_as_multiturn, --system_instruction work the same
All existing task YAMLs in lm_eval/tasks/ continue to work without modification