Migration Guide: v0.4 → v0.5¶
This guide covers the key changes in v0.5 of the LM Evaluation Harness and how to update your code.
Summary of changes¶
| Area | What changed |
|---|---|
| Task loading | TaskManager.load() replaces get_task_dict() / load_task_or_group() |
| Groups | Group dataclass replaces ConfigurableGroup |
| Scoring | New Scorer pipeline replaces direct metric processing |
| Metrics | Metric[_T, _K] generic dataclass, modular metric system |
| Formats | New declarative prompt format system (additive, not breaking) |
| Instances | Generic Instance[InputT, OutputT] with typed aliases |
| Config | TaskConfig is now a proper dataclass with TypedDict configs |
Task loading¶
Before (v0.4)¶
from lm_eval.tasks import get_task_dict, TaskManager
task_manager = TaskManager()
task_dict = get_task_dict(["hellaswag", "arc_easy"], task_manager)
# task_dict is a nested dict with ConfigurableGroup keys
After (v0.5)¶
from lm_eval.tasks import TaskManager
tm = TaskManager()
loaded = tm.load(["hellaswag", "arc_easy"])
loaded["tasks"] # {"hellaswag": Task, "arc_easy": Task}
loaded["groups"] # {}
loaded["group_map"] # {}
Key differences:
load()returns aTaskDictTypedDict with flattasks,groups, andgroup_mapkeys- Supports runtime overrides:
tm.load(["arc_easy"], overrides={"arc_easy": {"num_fewshot": 5}}) - Supports inline config dicts and the
@format/::path syntax get_task_dict()andload_task_or_group()are deprecated but still work
Groups¶
Before (v0.4)¶
from lm_eval.api.group import ConfigurableGroup
# ConfigurableGroup with custom __eq__/__hash__ based on group_name
# Used as dict keys in nested task dicts
After (v0.5)¶
from lm_eval.api.group import Group
# Group is a @dataclass — simpler, no custom __eq__/__hash__
# Returned in TaskDict["groups"] as a flat dict, not as dict keys
ConfigurableGroup is deprecated. Group is a straightforward dataclass.
Scoring pipeline¶
Before (v0.4)¶
Metrics were processed directly in the evaluator via Task.process_results() and Task.aggregation(). Filter pipelines were managed separately.
After (v0.5)¶
The Scorer class encapsulates the full pipeline: filter → score → reduce → aggregate.
Scorer hierarchy:
Scorer— abstract baseGenScorer— forgenerate_untiltasks (with 3 extensibility tiers)LLScorer— forloglikelihood/multiple_choicetasks- Built-in scorers:
ChoiceMatchScorer,FirstTokenScorer,RegexExtractionScorer
For YAML task authors: No changes needed for simple tasks. The task YAML fields (metric_list, filter_list) still work and are automatically routed through the scorer system.
For Python API users: If you were calling Task.process_results() directly, use the scorer pipeline instead.
Metrics¶
Before (v0.4)¶
# Monolithic lm_eval/api/metrics.py
from lm_eval.api.metrics import register_metric, register_aggregation
After (v0.5)¶
# Modular lm_eval/api/metrics/ package
from lm_eval.api.metrics import Metric, register_metric, register_aggregation
The Metric class is now a generic frozen dataclass:
@dataclass(frozen=True)
class Metric(Generic[_T, _K]):
name: str
fn: MetricFn[_T]
kwargs: Mapping[str, Any] = field(default_factory=dict)
aggregation: AggregationFn[_K] | None = None
higher_is_better: bool = True
output_type: set[str] = field(default_factory=lambda: {"multiple_choice"})
reduction: ReductionFn[_T, _K] | None = take_first
Type chain: fn(...) → _T, reduction(...) → _K, aggregation(Sequence[_K]) → float.
The metrics module is now split into sub-modules: metric.py, aggregations.py, corpus.py, generation.py, ll.py, reduce.py, stderr.py, utils.py.
For most users: register_metric and register_aggregation still work the same way.
Formats (new feature)¶
The declarative prompt format system is entirely new — no migration needed, but it's one of the biggest quality-of-life improvements in v0.5. Formats let task authors skip Jinja entirely for common prompt patterns.
Before (v0.4) — manual Jinja for a standard MCQA task:
output_type: multiple_choice
doc_to_text: "Question: {{question}}\n{% for letter, choice in zip(['A','B','C','D'], choices) %}{{letter}}. {{choice}}\n{% endfor %}Answer:"
doc_to_target: "{{['A','B','C','D'][answer]}}"
doc_to_choice: "{{choices}}"
After (v0.5) — same result with a format:
The format auto-generates the Jinja templates, sets output_type, configures delimiters, and wires up scoring.
Runtime format selection — try different prompt styles without touching YAML:
lm-eval run --tasks my_task@mcqa --model hf --model_args pretrained=gpt2
lm-eval run --tasks my_task@generate --model hf --model_args pretrained=gpt2
lm-eval run --tasks my_task@cloze --model hf --model_args pretrained=gpt2
Built-in formats: mcqa, cloze, generate, cot. See Prompt Formats for the full guide.
Instance types¶
Before (v0.4)¶
from lm_eval.api.instance import Instance
# Instance with metadata as a tuple: (task_name, doc_id, repeats)
inst = Instance(
request_type="loglikelihood",
doc=doc,
arguments=("context", "continuation"),
metadata=(task_name, doc_id, repeats),
)
After (v0.5)¶
from lm_eval.api.instance import Instance, LLInstance, GenInstance
# Instance is generic with explicit fields
inst = Instance(
request_type="loglikelihood",
doc=doc,
arguments=("context", "continuation"),
task_name=task_name,
doc_id=doc_id,
repeats=repeats,
target=gold_answer,
)
Key changes:
Instance[InputT, OutputT]is now generictask_name,doc_id,repeatsare explicit fields (not packed in a metadata tuple)- New
targetfield for gold references - New
additional_argsfield for multimodal support metadatais now adict[str, Any](not a tuple)- Type aliases:
LLInstance,GenInstance - Backward compatibility: if you pass a tuple as
metadata,__post_init__unpacks it
Type aliases¶
New type aliases in lm_eval.api._types:
| Type | Definition |
|---|---|
Doc |
dict[str, Any] |
DataSplit |
datasets.Dataset \| Sequence[Doc] |
Dataset |
Mapping[str, DataSplit] \| datasets.DatasetDict |
Context |
str \| list[dict[str, str]] |
LLArgs |
tuple[str, str] |
LLOutput |
tuple[float, bool] |
GenArgs |
tuple[Context, GenKwargs] |
Completion |
str |
Reference |
str \| list[str] \| int \| list[int] \| None |
TaskConfig¶
TaskConfig is now a proper dataclass (in lm_eval.config.task) with typed config dictionaries:
MetricConfig— TypedDict for metric entries inmetric_listFilterStep— TypedDict for filter steps infilter_listScorerConfig— TypedDict for scorer configurationFilterPipeline— A named pipeline with filters and optional per-pipeline metrics
What still works from v0.4¶
- YAML task configs are fully backward compatible
register_metric,register_aggregation,register_filter,register_modeldecorators work the samesimple_evaluate()API is unchanged--apply_chat_template,--fewshot_as_multiturn,--system_instructionwork the same- All existing task YAMLs in
lm_eval/tasks/continue to work without modification