Groups & Benchmarks¶
Groups let you organize related tasks into named collections with aggregate metrics — essential for benchmarks like MMLU (57 subtasks) or SuperGLUE.
Tags vs. Groups¶
| Feature | Tags | Groups |
|---|---|---|
| Purpose | Categorize tasks for batch selection | Organize tasks with aggregate scoring |
| Metrics | No aggregate metrics | Can define aggregate_metric_list |
| Display | Tasks listed individually | Group appears as a row with subtasks underneath |
| Nesting | Flat (tags can't contain tags) | Hierarchical (groups can contain groups) |
| Config | Set in task YAML: tag: [reasoning] |
Separate group YAML or inline |
Use tags when you just want to run a set of tasks together. Use groups when you need aggregate scores or hierarchical reporting.
Basic group config¶
Create a YAML file with group and task keys:
This creates a group named nli_tasks containing three tasks. Running lm-eval run --tasks nli_tasks evaluates all three and shows them under a group header in the results table.
Aggregate metrics¶
To report a single score for the group across all subtasks, add aggregate_metric_list:
group: nli_tasks
task:
- cb
- anli_r1
- rte
aggregate_metric_list:
- metric: acc
aggregation: mean
weight_by_size: true # micro-average (default)
metadata:
version: 1.0
aggregate_metric_list fields¶
| Field | Type | Description |
|---|---|---|
metric |
str | Name of the metric to aggregate (all subtasks must report this metric) |
aggregation |
str | Aggregation function (currently only mean is supported) |
weight_by_size |
bool | true for micro-average (weight by number of docs per subtask), false for macro-average (equal weight per subtask). Default: true |
filter_list |
str or list | Which filter pipeline(s) to match when aggregating (e.g., "strict-match"). Default: "none" |
Tip
Micro vs. macro averaging: MMLU uses micro-averaging — if one subject has 200 questions and another has 50, the larger subject contributes more to the final score. This is equivalent to concatenating all subtasks into one dataset. Set weight_by_size: false for macro-averaging (equal weight per subtask).
Overriding subtask config¶
Apply config overrides to subtasks within a group:
group: my_benchmark
task:
- task: mmlu
num_fewshot: 5 # Override few-shot count for all MMLU subtasks
- task: hellaswag
num_fewshot: 0
When the subtask is itself a group (like mmlu), the override propagates to all its children.
Inline subtask definitions¶
Define new subtasks directly inside the group config:
group: nli_and_mmlu
task:
- group: nli_tasks
task:
- cb
- anli_r1
- rte
aggregate_metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- task: mmlu
num_fewshot: 2
This creates a nested structure: nli_and_mmlu contains the inline nli_tasks group and the existing mmlu group.
Python class-based subtasks¶
For tasks implemented as Python classes, use !function:
group: scrolls
task:
- task: scrolls_qasper
class: !function task.Qasper
- task: scrolls_quality
class: !function task.QuALITY
- task: scrolls_narrativeqa
class: !function task.NarrativeQA
The :: path syntax¶
Navigate nested groups using :: on the CLI or in Python:
# Run only the anatomy subtask from MMLU
lm-eval run --tasks mmlu::mmlu_anatomy --model hf --model_args pretrained=gpt2
# Navigate deeper nesting
lm-eval run --tasks my_benchmark::nli_tasks::cb --model hf --model_args pretrained=gpt2
In Python:
Formats in groups¶
Apply prompt formats to tasks within groups using the @ suffix:
group: my_benchmark
task:
- task: subtask_a@mcqa
dataset_path: ...
doc_to_text: question
doc_to_target: answer
doc_to_choice: choices
- task: subtask_b
formats: generate
dataset_path: ...
doc_to_text: question
doc_to_target: answer
doc_to_choice: choices
Display names¶
Use group_alias and task_alias to provide cleaner display names:
group: mmlu
group_alias: "MMLU"
task:
- task: mmlu_abstract_algebra
task_alias: "Abstract Algebra"
- task: mmlu_anatomy
task_alias: "Anatomy"
This keeps the registered names unique while showing cleaner names in the results table.
GroupConfig reference¶
| Field | Type | Description |
|---|---|---|
group |
str | Group name (used on CLI to invoke the group) |
group_alias |
str | Display name for the results table |
task |
list | List of task names, group names, or inline task/group configs |
aggregate_metric_list |
list | Metrics to aggregate across subtasks (see fields above) |
metadata |
dict | Arbitrary metadata. Use version for versioning, num_fewshot to override the displayed n-shot value |