Task Configuration Reference¶

Complete reference for all TaskConfig fields. For a tutorial introduction, see Your First Task.

Task naming and registration¶

Field	Type	Default	Description
`task`	str	required	Task name. Must be unique. Used to invoke the task from CLI.
`task_alias`	str	`null`	Display name for the results table.
`tag`	str or list	`null`	Tag(s) for categorization. Enables batch selection via `--tasks tag_name`.

Dataset configuration¶

Field	Type	Default	Description
`dataset_path`	str	required	HuggingFace Hub dataset name, or a local path.
`dataset_name`	str	`null`	Dataset configuration name (the second argument to `datasets.load_dataset()`).
`dataset_kwargs`	dict	`null`	Extra kwargs passed to `datasets.load_dataset()` (e.g., `data_files`, `data_dir`).
`custom_dataset`	Callable	`null`	Function returning `dict[str, Dataset]`. Receives `metadata` and `model_args` at runtime.
`training_split`	str	`null`	Name of the training split.
`validation_split`	str	`null`	Name of the validation split.
`test_split`	str	`null`	Name of the test split (primary evaluation split).
`fewshot_split`	str	`null`	Split to draw few-shot examples from.
`process_docs`	Callable	`null`	Function to preprocess each dataset split before prompting. Use `!function utils.my_fn`.

Using local datasets¶

# JSON files
dataset_path: json
dataset_kwargs:
  data_files: /path/to/my/data.json

# Pre-split Arrow files
dataset_path: arrow
dataset_kwargs:
  data_files:
    train: /path/to/train/data.arrow
    validation: /path/to/validation/data.arrow

# Previously downloaded HF dataset (via save_to_disk)
dataset_path: hellaswag
dataset_kwargs:
  data_dir: hellaswag_local/

You can also set the LM_EVAL_DATASET_DIR environment variable as a fallback directory for local datasets.

Prompting and in-context formatting¶

Field	Type	Default	Description
`doc_to_text`	str or Callable	`null`	Jinja2 template, dataset column name, or `!function` producing the input prompt.
`doc_to_target`	str or Callable	`null`	Jinja2 template, column name, integer index, or `!function` producing the target.
`doc_to_choice`	str or Callable	`null`	Jinja2 template, column name, list, or `!function` producing answer choices (for `multiple_choice`).
`description`	str	`null`	Jinja2 template or string prepended before few-shot examples.
`use_prompt`	str	`null`	Promptsource template name (e.g., `"promptsource:GPT-3 Style"`). Overrides `doc_to_text`/`doc_to_target`/`doc_to_choice`.
`formats`	str or dict	`null`	Prompt format name or config. See Prompt Formats.
`target_delimiter`	str	`" "`	String inserted between `doc_to_text` and `doc_to_target`.
`fewshot_delimiter`	str	`"\n\n"`	String inserted between few-shot examples.
`gen_prefix`	str	`null`	String appended after the assistant token (or end of prompt without chat templates).

Few-shot configuration¶

Field	Type	Default	Description
`num_fewshot`	int	`0`	Number of few-shot examples.
`fewshot_config`	dict	`null`	Advanced few-shot settings (see below).

fewshot_config fields (all optional, inherit from parent TaskConfig):

Field	Description
`sampler`	`"default"` (random) or `"first_n"`
`split`	Dataset split for examples (overrides `fewshot_split`)
`samples`	Hardcoded list of example dicts
`doc_to_text`	Override formatting for few-shot examples
`doc_to_target`	Override target formatting for few-shot examples
`doc_to_choice`	Override choices for few-shot examples
`gen_prefix`	Prefix for assistant responses in few-shot examples
`fewshot_delimiter`	Override delimiter between examples
`target_delimiter`	Override delimiter between question and answer

Scoring¶

Field	Type	Default	Description
`output_type`	str	`"generate_until"`	Model request type: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, or `multiple_choice`.
`metric_list`	list	`null`	List of `MetricConfig` entries. See Scoring & Metrics.
`filter_list`	list	`null`	List of filter pipelines. See Filters.
`scorer`	dict	`null`	Scorer configuration. See Scoring & Metrics.
`generation_kwargs`	dict	`null`	Generation arguments (e.g., `temperature`, `max_gen_toks`, `until`). Note: the CLI flag is `--gen_kwargs` but the YAML field is `generation_kwargs`.
`repeats`	int	`1`	Number of repeated runs per sample. Used for self-consistency or sampling.

Other¶

Field	Type	Default	Description
`batch_size`	int	`1`	Batch size for this task.
`should_decontaminate`	bool	`false`	Whether to perform test set decontamination.
`doc_to_decontamination_query`	str	`null`	Query for decontamination (defaults to `doc_to_text` if not set).
`metadata`	dict	`null`	Arbitrary metadata. Special keys: `version` (task version), `num_fewshot` (override displayed n-shot). Also passed to `custom_dataset` if defined.

YAML features¶

`include` — template inheritance¶

Base your YAML on another file:

include: _default_template.yaml
task: mmlu_anatomy
dataset_name: anatomy

The included file provides shared fields, and your file overrides specific values. The include path is relative to the including file's directory unless an absolute path is given.

`!function` — embedded Python¶

Reference Python functions in your task directory:

doc_to_text: !function utils.my_doc_to_text
process_docs: !function utils.process_docs
metric_list:
  - metric: !function utils.my_metric
    aggregation: !function utils.my_aggregation

The function script must be in the same directory as the YAML file. Supported for: doc_to_text, doc_to_target, doc_to_choice, process_docs, metric in metric_list, aggregation in metric_list.

Python class-based tasks¶

For tasks that require full Python control:

task: squadv2
class: !function task.SQuAD2

Custom arguments can be passed to the class:

task: my_task
class: !function task.MyTaskClass
recipe: card=cards.my_card,template=templates.my_template

Versioning¶

metadata:
  version: 1.0

Bump the version for any breaking change to the task config. Document changes in the task's README:

- [Mar 2026] (PR #1234) Version 1.0 -> 2.0: Updated prompt format for consistency.

Good reference tasks¶

Type	Example
Multiple choice	`lm_eval/tasks/sciq/sciq.yaml`
Corpus perplexity	`lm_eval/tasks/wikitext/wikitext.yaml`
Generative	`lm_eval/tasks/gsm8k/gsm8k.yaml`
Complex filtering	`lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml`
Using formats	Any task with `formats: mcqa`