Groups & Benchmarks¶

Groups let you organize related tasks into named collections with aggregate metrics — essential for benchmarks like MMLU (57 subtasks) or SuperGLUE.

Tags vs. Groups¶

Feature	Tags	Groups
Purpose	Categorize tasks for batch selection	Organize tasks with aggregate scoring
Metrics	No aggregate metrics	Can define `aggregate_metric_list`
Display	Tasks listed individually	Group appears as a row with subtasks underneath
Nesting	Flat (tags can't contain tags)	Hierarchical (groups can contain groups)
Config	Set in task YAML: `tag: [reasoning]`	Separate group YAML or inline

Use tags when you just want to run a set of tasks together. Use groups when you need aggregate scores or hierarchical reporting.

Basic group config¶

Create a YAML file with group and task keys:

# lm_eval/tasks/nli/_nli.yaml
group: nli_tasks
task:
  - cb
  - anli_r1
  - rte
metadata:
  version: 1.0

This creates a group named nli_tasks containing three tasks. Running lm-eval run --tasks nli_tasks evaluates all three and shows them under a group header in the results table.

Aggregate metrics¶

To report a single score for the group across all subtasks, add aggregate_metric_list:

group: nli_tasks
task:
  - cb
  - anli_r1
  - rte
aggregate_metric_list:
  - metric: acc
    aggregation: mean
    weight_by_size: true   # micro-average (default)
metadata:
  version: 1.0

`aggregate_metric_list` fields¶

Field	Type	Description
`metric`	str	Name of the metric to aggregate (all subtasks must report this metric)
`aggregation`	str	Aggregation function (currently only `mean` is supported)
`weight_by_size`	bool	`true` for micro-average (weight by number of docs per subtask), `false` for macro-average (equal weight per subtask). Default: `true`
`filter_list`	str or list	Which filter pipeline(s) to match when aggregating (e.g., `"strict-match"`). Default: `"none"`

Tip

Micro vs. macro averaging: MMLU uses micro-averaging — if one subject has 200 questions and another has 50, the larger subject contributes more to the final score. This is equivalent to concatenating all subtasks into one dataset. Set weight_by_size: false for macro-averaging (equal weight per subtask).

Overriding subtask config¶

Apply config overrides to subtasks within a group:

group: my_benchmark
task:
  - task: mmlu
    num_fewshot: 5          # Override few-shot count for all MMLU subtasks
  - task: hellaswag
    num_fewshot: 0

When the subtask is itself a group (like mmlu), the override propagates to all its children.

Inline subtask definitions¶

Define new subtasks directly inside the group config:

group: nli_and_mmlu
task:
  - group: nli_tasks
    task:
      - cb
      - anli_r1
      - rte
    aggregate_metric_list:
      - metric: acc
        aggregation: mean
        higher_is_better: true

  - task: mmlu
    num_fewshot: 2

This creates a nested structure: nli_and_mmlu contains the inline nli_tasks group and the existing mmlu group.

Python class-based subtasks¶

For tasks implemented as Python classes, use !function:

group: scrolls
task:
  - task: scrolls_qasper
    class: !function task.Qasper
  - task: scrolls_quality
    class: !function task.QuALITY
  - task: scrolls_narrativeqa
    class: !function task.NarrativeQA

The `::` path syntax¶

Navigate nested groups using :: on the CLI or in Python:

# Run only the anatomy subtask from MMLU
lm-eval run --tasks mmlu::mmlu_anatomy --model hf --model_args pretrained=gpt2

# Navigate deeper nesting
lm-eval run --tasks my_benchmark::nli_tasks::cb --model hf --model_args pretrained=gpt2

In Python:

from lm_eval.tasks import TaskManager

tm = TaskManager()
loaded = tm.load(["mmlu::mmlu_anatomy"])

Formats in groups¶

Apply prompt formats to tasks within groups using the @ suffix:

group: my_benchmark
task:
  - task: subtask_a@mcqa
    dataset_path: ...
    doc_to_text: question
    doc_to_target: answer
    doc_to_choice: choices

  - task: subtask_b
    formats: generate
    dataset_path: ...
    doc_to_text: question
    doc_to_target: answer
    doc_to_choice: choices

Display names¶

Use group_alias and task_alias to provide cleaner display names:

group: mmlu
group_alias: "MMLU"
task:
  - task: mmlu_abstract_algebra
    task_alias: "Abstract Algebra"
  - task: mmlu_anatomy
    task_alias: "Anatomy"

This keeps the registered names unique while showing cleaner names in the results table.

GroupConfig reference¶

Field	Type	Description
`group`	str	Group name (used on CLI to invoke the group)
`group_alias`	str	Display name for the results table
`task`	list	List of task names, group names, or inline task/group configs
`aggregate_metric_list`	list	Metrics to aggregate across subtasks (see fields above)
`metadata`	dict	Arbitrary metadata. Use `version` for versioning, `num_fewshot` to override the displayed n-shot value