Group Configuration¶

Fields for defining task groups, hierarchical organization, and aggregate scoring behavior.

group ¶

Attributes¶

eval_logger `module-attribute` ¶

eval_logger = getLogger(__name__)

Classes¶

AggMetricConfig `dataclass` ¶

AggMetricConfig(metric: str, filter_list: list[str] | None = None, aggregation: str | Callable = 'mean', weight_by_size: bool = True)

Configuration for how to aggregate a metric across a group's children.

Maps to the entries in aggregate_metric_list in a group YAML file.

Example

aggregate_metric_list:
  - metric: acc
    filter_list: ["none"]
    aggregation: mean
    weight_by_size: true

Attributes¶

metric `instance-attribute` ¶

metric: str

Name of the metric to aggregate across subtasks (e.g. "acc", "exact_match"). All children must report a metric with this name.

filter_list `class-attribute` `instance-attribute` ¶

filter_list: list[str] | None = None

Filter pipeline names to aggregate over (e.g. ["none"], ["strict-match"]). If None, filters are auto-discovered from child task results. A bare string is normalized to a single-element list.

aggregation `class-attribute` `instance-attribute` ¶

aggregation: str | Callable = 'mean'

Aggregation function to combine per-subtask metrics. Currently only "mean" is supported as a built-in; a custom callable may also be passed.

weight_by_size `class-attribute` `instance-attribute` ¶

weight_by_size: bool = True

If True (default), micro-average: weight each subtask's metric by its sample count. If False, macro-average: each subtask contributes equally regardless of size.

Functions¶

__post_init__ ¶

__post_init__()

Source code in lm_eval/config/group.py

def __post_init__(self):
    if self.aggregation != "mean" and not callable(self.aggregation):
        raise ValueError(
            f"Currently, 'mean' is the only pre-defined aggregation. Got '{self.aggregation}'."
        )
    # Handle filter_list: None means auto-discover, string becomes list
    if self.filter_list is None:
        pass  # Keep as None for auto-discovery
    elif isinstance(self.filter_list, str):
        self.filter_list = [self.filter_list]

GroupConfig `dataclass` ¶

GroupConfig(group: str, group_alias: str | None = None, task: str | list[str | dict[str, str | dict[str, str]]] | None = None, include: str | dict[str, Any] | None = None, aggregate_metric_list: list[AggMetricConfig] | list[dict] | None = None, metadata: dict[str, Any] | None = None)

Typed representation of a group YAML configuration.

This is the ground-truth schema for group YAML files. Raw dicts parsed from YAML are fed through this dataclass so that loose input types (single strings, bare dicts, etc.) are normalized into canonical forms.

Example

group: mmlu
group_alias: MMLU
task:
  - mmlu_anatomy
  - mmlu_biology
  - group: mmlu_chemistry
    task:
      - mmlu_elementary_chemistry
  - task: some_other_task
aggregate_metric_list:
  - metric: acc
    filter_list: ["none"]
    aggregation: mean
    weight_by_size: true
metadata:
  version: 1.0

Attributes¶

group `instance-attribute` ¶

group: str

Unique identifier for the group, used for CLI selection (e.g. --tasks mmlu).

group_alias `class-attribute` `instance-attribute` ¶

group_alias: str | None = None

Optional display name shown in result tables instead of group.

task `class-attribute` `instance-attribute` ¶

task: str | list[str | dict[str, str | dict[str, str]]] | None = None

Child task and/or group references. Can be a single name, a list of names, or a list of dicts for inline overrides and nested groups. A bare string is normalized to a single-element list.

include `class-attribute` `instance-attribute` ¶

include: str | dict[str, Any] | None = None

Task-level defaults applied to every child in this group.

Can be a path (str) to a YAML file with task fields, or an inline dict of key-value pairs. When a path is given it is resolved relative to the group YAML file's directory.

Example (path):

group: my_bench
include: shared_defaults.yaml
task:
  - task_a
  - task_b

Example (inline):

group: my_bench
include:
  num_fewshot: 5
  doc_to_text: "{{question}}"
task:
  - task_a
  - task_b

aggregate_metric_list `class-attribute` `instance-attribute` ¶

aggregate_metric_list: list[AggMetricConfig] | list[dict] | None = None

Metrics to aggregate across child tasks. Without this, the group appears as a header row with no aggregate score. Accepts a single AggMetricConfig, a dict, or a list of either.

metadata `class-attribute` `instance-attribute` ¶

metadata: dict[str, Any] | None = None

Arbitrary metadata stored alongside results (e.g. {"version": 1.0}). The num_fewshot key overrides the displayed n-shot column for the group in result tables.

Functions¶

__post_init__ ¶

__post_init__()

Source code in lm_eval/config/group.py

def __post_init__(self):
    if isinstance(self.task, str):
        self.task = [self.task]
    if self.aggregate_metric_list is not None:
        if isinstance(self.aggregate_metric_list, (dict, AggMetricConfig)):
            self.aggregate_metric_list = [self.aggregate_metric_list]
        self.aggregate_metric_list = [
            AggMetricConfig(**item) if isinstance(item, dict) else item  # type:ignore[invalid-argument-type]
            for item in self.aggregate_metric_list
        ]
    else:
        eval_logger.warning(
            "[Group '%s] has no `aggregate_metric_list` set — "
            "group-level aggregations will not be computed. "
            "To enable them, add an `aggregate_metric_list` to the group config.",
            self.group,
        )

to_dict ¶

to_dict(keep_callable: bool = False) -> dict[str, str]

Source code in lm_eval/config/group.py

def to_dict(self, keep_callable: bool = False) -> dict[str, str]:
    from lm_eval.config.utils import serialize_config

    return serialize_config(self, keep_callable=keep_callable)

Group Configuration¶

group ¶

Attributes¶

eval_logger module-attribute ¶

Classes¶

AggMetricConfig dataclass ¶

Attributes¶

metric instance-attribute ¶

filter_list class-attribute instance-attribute ¶

aggregation class-attribute instance-attribute ¶

weight_by_size class-attribute instance-attribute ¶

Functions¶

__post_init__ ¶

GroupConfig dataclass ¶

Attributes¶

group instance-attribute ¶

group_alias class-attribute instance-attribute ¶

task class-attribute instance-attribute ¶

include class-attribute instance-attribute ¶

aggregate_metric_list class-attribute instance-attribute ¶

metadata class-attribute instance-attribute ¶

Functions¶

__post_init__ ¶

to_dict ¶

eval_logger `module-attribute` ¶

AggMetricConfig `dataclass` ¶

metric `instance-attribute` ¶

filter_list `class-attribute` `instance-attribute` ¶

aggregation `class-attribute` `instance-attribute` ¶

weight_by_size `class-attribute` `instance-attribute` ¶

GroupConfig `dataclass` ¶

group `instance-attribute` ¶

group_alias `class-attribute` `instance-attribute` ¶

task `class-attribute` `instance-attribute` ¶

include `class-attribute` `instance-attribute` ¶

aggregate_metric_list `class-attribute` `instance-attribute` ¶

metadata `class-attribute` `instance-attribute` ¶