TaskConfig¶
All YAML fields available when defining an evaluation task. This includes the Jinja2 templating API for prompt construction (doc_to_text, doc_to_target, description, etc.).
TaskConfig
dataclass
¶
TaskConfig(*, task: str, task_alias: str | None = None, formats: str | dict[str, str] | None = None, output_type: _OutputType = 'generate_until', tag: list[str] = list(), custom_dataset: Callable[..., Dataset] | None = None, dataset_path: str | None = None, dataset_name: str | None = None, dataset_kwargs: dict[str, str | int | float] = dict(), training_split: str | None = None, validation_split: str | None = None, test_split: str | None = None, fewshot_split: str | None = None, process_docs: Callable[..., list[dict[str, Any]]] | None = None, description: str = '', doc_to_text: str | Callable[[Doc], str | list[str]] | None = None, doc_to_choice: str | Callable[[Doc], list[str]] | list[str] | None = None, doc_to_target: str | Callable[[Doc], str | int | list[int] | list[str]] | None = None, gen_prefix: str | None = None, doc_to_image: Callable[[Doc], Any] | str | None = None, doc_to_audio: Callable[[Doc], Any] | str | None = None, process_results: Callable[[dict[str, Any], Sequence[LLOutput] | Sequence[Completion]], dict[str, list[Any]]] | None = None, target_delimiter: str = ' ', fewshot_delimiter: str = '\n\n', fewshot_config: dict[str, Any] | FewshotConfig | None = None, num_fewshot: int | None = None, generation_kwargs: GenKwargs = dict(), metric_list: list[MetricConfig] = list(), filter_list: list[FilterPipeline] = list(), scorer: str | ScorerConfig | None = None, repeats: int = 1, unsafe_code: bool = False, use_prompt: str | None = None, multiple_inputs: bool = False, multiple_targets: bool = False, should_decontaminate: bool = False, doc_to_decontamination_query: str | None = None, metadata: dict[str, Any] = dict(), _formats_selection: str | None = None, _qualified_name: str | None = None)
Configuration for a single evaluation task.
Maps 1:1 with the YAML task config files under lm_eval/tasks/.
Every key in a task YAML corresponds to a field here.
Example
Attributes¶
task
instance-attribute
¶
Unique task identifier used for registration and CLI selection
(e.g. --tasks arc_easy). Append @<formats> to select a prompt
formats at runtime (e.g. "arc_easy@cloze").
task_alias
class-attribute
instance-attribute
¶
Optional display name shown in result tables instead of task.
formats
class-attribute
instance-attribute
¶
Prompt formats to apply. Can be a registered formats name (e.g. "cloze",
"mcq"), or an inline dict of [FormatConfig][lm_eval.config.task.TaskConfig.formats.formats.FormatConfig] fields. When set, the format's
template overrides doc_to_text, doc_to_target, output_type, etc.
If None and task contains @, the suffix is used as the formats name.
output_type
class-attribute
instance-attribute
¶
The type of model request to construct for each document.
"generate_until": open-ended text generation (default)."loglikelihood": score the likelihood of a target string."loglikelihood_rolling": score a full sequence without a context split."multiple_choice": rank answer choices by loglikelihood.
tag
class-attribute
instance-attribute
¶
Tags for categorizing this task. Users can select all tasks sharing
a tag via --tasks <tag_name> (e.g. tag: [math, reasoning] lets
users run --tasks math to include this task).
Distinct from explicit group configs (see GroupConfig).
custom_dataset
class-attribute
instance-attribute
¶
A callable that returns a HuggingFace DatasetDict. Should accept
arbitrary kwargs. Typically set in YAML via !function utils.xxx
Use this when you need custom loading logic instead of datasets.load_dataset.
At runtime, receives metadata (from this config) and model_args
(if using evaluate) as keyword arguments.
dataset_path
class-attribute
instance-attribute
¶
HuggingFace dataset path passed to datasets.load_dataset().
Can be a Hub identifier (e.g. "allenai/ai2_arc") or a local path.
dataset_name
class-attribute
instance-attribute
¶
HuggingFace dataset config/subset name (e.g. "ARC-Easy").
dataset_kwargs
class-attribute
instance-attribute
¶
Extra keyword arguments forwarded to datasets.load_dataset()
(e.g. {"data_dir": "path/to/data"} or {"data_files": "data.json"}).
training_split
class-attribute
instance-attribute
¶
Name of the training split in the dataset (e.g. "train").
validation_split
class-attribute
instance-attribute
¶
Name of the validation split. Used as the evaluation split when
test_split is not set.
test_split
class-attribute
instance-attribute
¶
Name of the test split. When set, this is the primary split evaluated.
fewshot_split
class-attribute
instance-attribute
¶
Name of the split from which few-shot examples are drawn. Passed as
the default for fewshot_config.split; overridden if fewshot_config
explicitly sets its own split.
process_docs
class-attribute
instance-attribute
¶
A callable applied to a dataset split before evaluation. Use this
to filter, transform, or resample documents (e.g. renaming columns,
expanding multi-answer rows). Typically set in YAML via !function utils.xxx
description
class-attribute
instance-attribute
¶
A Jinja2 template or plain string prepended to every prompt. Useful for task-level instructions, e.g. ``"The following are questions (with answers) about {{subject}}.
".
When a chat template is applied, this is combined withsystem_instructionand sent as thesystem`` message.
doc_to_text
class-attribute
instance-attribute
¶
Converts a document dict into the prompt text shown to the model.
Can be a Jinja2 template string (e.g. "{{question}}"), a column name,
or a callable. For loglikelihood tasks this is the context preceding
the target.
doc_to_choice
class-attribute
instance-attribute
¶
Defines the set of answer choices for multiple-choice tasks.
Can be a Jinja2 template (e.g. "{{choices.text}}"), a column name,
a static list of strings, or a callable returning a list of strings.
doc_to_target
class-attribute
instance-attribute
¶
The gold-standard target for each document.
Can be the column name, a Jinja2 template, or a callable. For multiple-choice
tasks this is typically the integer index into doc_to_choice (e.g. "{{answer}}").
For generation tasks, it is the expected answer string.
gen_prefix
class-attribute
instance-attribute
¶
A string or Jinja2 template appended after the prompt (and choices, if any) but before
the model generates or the target is scored. With a chat template, this
is appended after the <|assistant|> token; without one it is appended
to the end of the prompt. Useful for answer cues like "The answer is".
doc_to_image
class-attribute
instance-attribute
¶
Extracts an image from the document for multimodal models. Can be a column name or a callable returning image data.
doc_to_audio
class-attribute
instance-attribute
¶
Extracts audio from the document for multimodal models. Can be a column name or a callable returning audio data.
process_results
class-attribute
instance-attribute
¶
process_results: Callable[[dict[str, Any], Sequence[LLOutput] | Sequence[Completion]], dict[str, list[Any]]] | None = None
Custom post-processing of model outputs for metric computation.
Receives (doc, results) and returns a dict mapping metric names to
lists of values. Typically set in YAML via !function utils.xxx.
target_delimiter
class-attribute
instance-attribute
¶
String inserted between the input (prompt/choices) and the target output for each example (both few-shot and the test document).
fewshot_delimiter
class-attribute
instance-attribute
¶
String inserted between consecutive few-shot examples.
Also used as the default until stop sequence for generation.
fewshot_config
class-attribute
instance-attribute
¶
fewshot_config: dict[str, Any] | FewshotConfig | None = None
Advanced few-shot configuration. Accepts a dict or FewshotConfig
to override how few-shot examples are sampled and formatted (e.g.
separate doc_to_text for examples, custom sampler, fixed indices).
When None or a dict, it is converted to FewshotConfig in __post_init__.
num_fewshot
class-attribute
instance-attribute
¶
Number of few-shot examples to prepend to each prompt. When None,
the value is determined at runtime (typically by CLI --num_fewshot).
generation_kwargs
class-attribute
instance-attribute
¶
Keyword arguments for text generation (e.g. temperature, until,
max_gen_toks, do_sample). Only relevant when output_type
is "generate_until". If empty, greedy defaults are applied.
metric_list
class-attribute
instance-attribute
¶
metric_list: list[MetricConfig] = field(default_factory=list)
List of metrics to compute on model outputs. Each entry specifies
a metric name, optional aggregation function, and whether higher is
better (e.g. [{"metric": "exact_match", "higher_is_better": true}]).
filter_list
class-attribute
instance-attribute
¶
filter_list: list[FilterPipeline] = field(default_factory=list)
List of named filter pipelines applied to model outputs before scoring.
Each pipeline is a sequence of filter steps (e.g. regex extraction,
stripping) and can carry its own metric_list. Pipelines run
independently on the same model outputs, allowing multiple scoring
strategies from a single evaluation run (e.g. "strict-match"
and "maj@64" on GSM8k).
scorer
class-attribute
instance-attribute
¶
scorer: str | ScorerConfig | None = None
A registered scorer name or inline scorer config. When set, scoring is delegated to this scorer instead of the default metric pipeline.
Accepts a string (e.g. "first_token") which is normalised to
{"type": "first_token"} in __post_init__, or a full
ScorerConfig dict with extra kwargs forwarded to the scorer
constructor.
repeats
class-attribute
instance-attribute
¶
Number of times to repeat each instance. Only used for generation tasks. Useful for sampling diversity (e.g. pass@k, self-consistency).
unsafe_code
class-attribute
instance-attribute
¶
Set to True to enable execution of untrusted code (e.g. for code-execution benchmarks). Must be explicitly opted in.
use_prompt
class-attribute
instance-attribute
¶
Name of a registered prompt template to apply (e.g.
"promptsource:GPT-3 Style"). When set, overrides doc_to_text,
doc_to_target, and doc_to_choice.
multiple_inputs
class-attribute
instance-attribute
¶
Only for multiple_choice tasks. When True, doc_to_text returns
a list of strings (one per choice) and doc_to_choice returns a 1 elememnt list.
Each choice produces a different context scored via
loglikelihood (e.g. Winogrande, where each option fills a blank).
multiple_targets
class-attribute
instance-attribute
¶
When True, doc_to_target may return a list of acceptable answers.
Scoring considers any match a success.
should_decontaminate
class-attribute
instance-attribute
¶
Whether to run decontamination checks against training data.
doc_to_decontamination_query
class-attribute
instance-attribute
¶
Jinja2 template or callable that extracts the decontamination query
string from a document. Used when should_decontaminate is True.
Falls back to doc_to_text if left as None.
metadata
class-attribute
instance-attribute
¶
Metadata dict stored alongside results. Most tasks should include a
version key. The num_fewshot key overrides the displayed n-shot
column in result tables. Also passed to custom_dataset at runtime to pass arbitrary kwargs
Functions¶
__post_init__
¶
to_dict
¶
Dumps the current config as a dictionary object, as a printable format.
null fields will not be printed. Used for dumping results alongside a full task configuration
:return: dict A printable dictionary version of the TaskConfig object.
Source code in lm_eval/config/task.py
FewshotConfig
dataclass
¶
FewshotConfig(sampler: str = 'default', split: str | None = None, process_docs: Callable[..., list[dict[str, Any]]] | None = None, fewshot_indices: list[int] | None = None, samples: list[dict[str, Any]] | Callable[[], list[dict[str, Any]]] | None = None, doc_to_text: str | Callable[[Doc], str] | None = None, doc_to_choice: str | Callable[[Doc], list[str]] | list[str] | None = None, doc_to_target: str | Callable[[Doc], str | int] | None = None, gen_prefix: str | None = None, fewshot_delimiter: str | None = None, target_delimiter: str | None = None)
Configuration for few-shot example formatting.
These fields override the parent TaskConfig fields when formatting few-shot examples (as opposed to the test example).
note: num_fewshot is also runtime-dependent, so is not included here.
Attributes¶
sampler
class-attribute
instance-attribute
¶
Sampling strategy for selecting few-shot examples (e.g. "default",
"first_n"). "default" samples randomly.
split
class-attribute
instance-attribute
¶
Dataset split to draw few-shot examples from. Inherited from
TaskConfig.fewshot_split if not set directly. Takes precedence
over samples when both are provided.
process_docs
class-attribute
instance-attribute
¶
Optional callable to transform the few-shot split before sampling.
Inherited from TaskConfig.process_docs if not set.
fewshot_indices
class-attribute
instance-attribute
¶
Explicit list of document indices to use as few-shot examples. When set, overrides random sampling with a fixed selection.
samples
class-attribute
instance-attribute
¶
Hardcoded few-shot examples as a list of dicts, or a callable
returning such a list. Used when examples don't come from a dataset
split. Ignored if split is also set.
doc_to_text
class-attribute
instance-attribute
¶
Override doc_to_text for formatting few-shot examples differently
from the test example. Inherited from TaskConfig.doc_to_text.
doc_to_choice
class-attribute
instance-attribute
¶
Override doc_to_choice for few-shot examples.
Inherited from TaskConfig.doc_to_choice.
doc_to_target
class-attribute
instance-attribute
¶
Override doc_to_target for few-shot examples.
Inherited from TaskConfig.doc_to_target.
gen_prefix
class-attribute
instance-attribute
¶
Override gen_prefix for few-shot examples.
Inherited from TaskConfig.gen_prefix.
fewshot_delimiter
class-attribute
instance-attribute
¶
Override the delimiter between few-shot examples.
Inherited from TaskConfig.fewshot_delimiter.
target_delimiter
class-attribute
instance-attribute
¶
Override the delimiter between prompt and target in few-shot examples.
Inherited from TaskConfig.target_delimiter.
Functions¶
__post_init__
¶
get_docs
¶
Source code in lm_eval/config/task.py
from_dict
classmethod
¶
from_dict(cfg: Mapping[str, Any], *, fewshot_split: str | None = None, process_docs: Callable[[Iterable[dict[str, Any]]], list[Doc]] | None = None, fewshot_delimiter: str | None = None, target_delimiter: str | None = None, gen_prefix: str | None = None, doc_to_text: str | Callable[[Doc], str | list[str]] | None = None, doc_to_choice: str | Callable[[Doc], list[str]] | list[str] | None = None, doc_to_target: str | Callable[[Doc], str | int | list[int] | list[str]] | None = None, **overloads) -> FewshotConfig
Source code in lm_eval/config/task.py
MetricConfig
¶
Bases: TypedDict
flowchart TD
lm_eval.config.task.MetricConfig[MetricConfig]
click lm_eval.config.task.MetricConfig href "" "lm_eval.config.task.MetricConfig"
Configuration for a single metric in metric_list.
Example
Attributes¶
metric
instance-attribute
¶
metric: Required[str | MetricFn]
Name of a registered metric (e.g. "acc", "exact_match",
"bleu") or a callable. See lm_eval/api/metrics.py for
built-in metrics.
aggregation
instance-attribute
¶
aggregation: str | AggregationFn | None
How per-document metric values are combined into a single score
(e.g. "mean", "median"). Can be a registered aggregation or a callable.
Defaults to the metric's registered aggregation if not set.
reduction
instance-attribute
¶
reduction: str | ReductionFn | None
How per-instance repeated values are reduced before aggregation
(e.g. when repeats > 1). Can be a registered reduction or a callable.
Defaults to the metric's registered reduction if not set.
higher_is_better
instance-attribute
¶
Whether a higher metric value indicates better performance. Defaults to the metric's registered value if not set.
kwargs
instance-attribute
¶
Extra keyword arguments forwarded to the metric function
(e.g. {"ignore_case": true} for exact_match) Extreneous fields are also.
treated as kwargs.
FilterPipeline
¶
Bases: TypedDict
flowchart TD
lm_eval.config.task.FilterPipeline[FilterPipeline]
click lm_eval.config.task.FilterPipeline href "" "lm_eval.config.task.FilterPipeline"
A named filter pipeline with optional per-pipeline metrics.
Mirrors each element of the filter_list entries in YAML task configs.
Example
Attributes¶
name
instance-attribute
¶
Identifier for this pipeline, used as a prefix in result keys
(e.g. "strict-match", "maj@64").
filter
instance-attribute
¶
filter: Required[list[FilterStep]]
Ordered sequence of filter steps applied to model outputs. Steps run in order; each step's output feeds into the next.
metric_list
instance-attribute
¶
metric_list: list[MetricConfig]
Optional per-pipeline metrics. When set, these metrics are scored
only on this pipeline's filtered outputs, instead of the task-level
metric_list.
FilterStep
¶
Bases: TypedDict
flowchart TD
lm_eval.config.task.FilterStep[FilterStep]
click lm_eval.config.task.FilterStep href "" "lm_eval.config.task.FilterStep"
A single filter step in a pipeline.
Example
Attributes¶
function
instance-attribute
¶
Name of a registered filter (e.g. "regex", "custom",
"majority_vote"). Custom filters can be registered with
@register_filter.
kwargs
instance-attribute
¶
Keyword arguments passed to the filter function
(e.g. {"regex_pattern": "The answer is (\d+)"} for "regex",
or {"filter_fn": "!function utils.my_filter"} for "custom").
ScorerConfig
¶
Bases: TypedDict
flowchart TD
lm_eval.config.task.ScorerConfig[ScorerConfig]
click lm_eval.config.task.ScorerConfig href "" "lm_eval.config.task.ScorerConfig"
Configuration for a registered scorer.
A scorer encapsulates the full filter → score → reduce → aggregate
pipeline. When scorer is set on a task, scoring is delegated to
the registered scorer class instead of the default metric pipeline.
Can be specified as a plain string (just the scorer name) or as a
dict with type and optional kwargs forwarded to the scorer
constructor.
Example
See lm_eval/scorers/ for built-in scorers. Custom scorers can
be registered with @register_scorer.
Attributes¶
type
instance-attribute
¶
Name of a registered scorer (e.g. "first_token", "regex",
"choice_match"). Resolved via lm_eval.api.registry.get_scorer.
kwargs
instance-attribute
¶
Extra keyword arguments forwarded to the scorer constructor.
Scorer subclasses declare these as dataclass fields
(e.g. judge_model on an AIJudgeScorer).