Evaluation Config¶

Configuration dataclass for managing evaluation settings. Can be instantiated directly or loaded from a YAML file.

evaluate_config ¶

Attributes¶

eval_logger `module-attribute` ¶

eval_logger = getLogger(__name__)

DICT_KEYS `module-attribute` ¶

DICT_KEYS = ['wandb_args', 'wandb_config_args', 'hf_hub_log_args', 'metadata', 'model_args', 'gen_kwargs']

Classes¶

EvaluatorConfig `dataclass` ¶

EvaluatorConfig(config: str | None = None, model: str = 'hf', model_args: dict = dict(), tasks: str | list[str] = list(), num_fewshot: int | None = None, repeats: int | None = None, batch_size: int = 1, max_batch_size: int | None = None, device: str | None = 'cuda:0', limit: float | None = None, samples: str | dict | None = None, use_cache: str | None = None, cache_requests: dict = dict(), check_integrity: bool = False, write_out: bool = False, log_samples: bool = False, output_path: str | None = None, predict_only: bool = False, system_instruction: str | None = None, apply_chat_template: bool | str = False, fewshot_as_multiturn: bool | None = None, show_config: bool = False, include_path: str | None = None, include_defaults: bool = True, gen_kwargs: dict = dict(), verbosity: str | None = None, wandb_args: dict = dict(), wandb_config_args: dict = dict(), hf_hub_log_args: dict = dict(), seed: list = (lambda: [0, 1234, 1234, 1234])(), trust_remote_code: bool = False, confirm_run_unsafe_code: bool = False, metadata: dict = dict())

Configuration for language model evaluation runs.

This dataclass contains all parameters for configuring model evaluations via simple_evaluate or the CLI. It supports initialization from:

CLI arguments (via from_cli)
YAML configuration files (via from_config)
Direct instantiation with keyword arguments

The configuration handles argument parsing, validation, and preprocessing to ensure properly structured and validated.

Example

# From CLI arguments
config = EvaluatorConfig.from_cli(args)

# From YAML file
config = EvaluatorConfig.from_config("eval_config.yaml")

# Direct instantiation
config = EvaluatorConfig(
    model="hf",
    model_args={"pretrained": "gpt2"},
    tasks=["hellaswag", "arc_easy"],
    num_fewshot=5,
)

Attributes¶

config `class-attribute` `instance-attribute` ¶

config: str | None = None

Path to a YAML config file. CLI args override values from the file.

model `class-attribute` `instance-attribute` ¶

model: str = 'hf'

Name of the model backend (e.g. "hf", "vllm", "openai").

model_args `class-attribute` `instance-attribute` ¶

model_args: dict = field(default_factory=dict)

Arguments for model initialization, passed to the model constructor.

tasks `class-attribute` `instance-attribute` ¶

tasks: str | list[str] = field(default_factory=list)

Task names to evaluate. Accepts a comma-separated string or a list.

num_fewshot `class-attribute` `instance-attribute` ¶

num_fewshot: int | None = None

Number of examples in few-shot context.

repeats `class-attribute` `instance-attribute` ¶

repeats: int | None = None

Number of repeats for each request (overrides task config).

batch_size `class-attribute` `instance-attribute` ¶

batch_size: int = 1

Batch size for evaluation.

max_batch_size `class-attribute` `instance-attribute` ¶

max_batch_size: int | None = None

Maximum batch size for auto batching.

device `class-attribute` `instance-attribute` ¶

device: str | None = 'cuda:0'

Device to use (e.g. "cuda", "cuda:0", "cpu").

limit `class-attribute` `instance-attribute` ¶

limit: float | None = None

Limit number of examples per task. Mutually exclusive with samples.

samples `class-attribute` `instance-attribute` ¶

samples: str | dict | None = None

Dict, JSON string, or path to a JSON file mapping task names to doc indices.

use_cache `class-attribute` `instance-attribute` ¶

use_cache: str | None = None

Path to a SQLite DB file for caching model outputs.

cache_requests `class-attribute` `instance-attribute` ¶

cache_requests: dict = field(default_factory=dict)

Cache dataset requests. Values: true / "refresh" / "delete".

check_integrity `class-attribute` `instance-attribute` ¶

check_integrity: bool = False

Run the test suite for tasks.

write_out `class-attribute` `instance-attribute` ¶

write_out: bool = False

Print prompts for the first few documents.

log_samples `class-attribute` `instance-attribute` ¶

log_samples: bool = False

Save model outputs and inputs. Requires output_path.

output_path `class-attribute` `instance-attribute` ¶

output_path: str | None = None

Directory path where result metrics will be saved.

predict_only `class-attribute` `instance-attribute` ¶

predict_only: bool = False

Only save model outputs without evaluating metrics. Implies log_samples.

system_instruction `class-attribute` `instance-attribute` ¶

system_instruction: str | None = None

Custom system instruction prepended to every prompt.

apply_chat_template `class-attribute` `instance-attribute` ¶

apply_chat_template: bool | str = False

Apply chat template to the prompt. Either True, or a string naming the tokenizer template.

fewshot_as_multiturn `class-attribute` `instance-attribute` ¶

fewshot_as_multiturn: bool | None = None

Use fewshot examples as multi-turn conversation. Defaults to True when apply_chat_template is set.

show_config `class-attribute` `instance-attribute` ¶

show_config: bool = False

Show the full config at the end of evaluation.

include_path `class-attribute` `instance-attribute` ¶

include_path: str | None = None

Additional directory path for external tasks.

include_defaults `class-attribute` `instance-attribute` ¶

include_defaults: bool = True

Whether to include built-in tasks from lm_eval/tasks/.

gen_kwargs `class-attribute` `instance-attribute` ¶

gen_kwargs: dict = field(default_factory=dict)

Generation arguments passed to the model. Overrides task-level defaults.

verbosity `class-attribute` `instance-attribute` ¶

verbosity: str | None = None

Logging verbosity level.

wandb_args `class-attribute` `instance-attribute` ¶

wandb_args: dict = field(default_factory=dict)

Arguments for wandb.init.

wandb_config_args `class-attribute` `instance-attribute` ¶

wandb_config_args: dict = field(default_factory=dict)

Arguments for wandb.config.update.

hf_hub_log_args `class-attribute` `instance-attribute` ¶

hf_hub_log_args: dict = field(default_factory=dict)

Arguments for HF Hub logging.

seed `class-attribute` `instance-attribute` ¶

seed: list = field(default_factory=lambda: [0, 1234, 1234, 1234])

Seeds as [random, numpy, torch, fewshot].

trust_remote_code `class-attribute` `instance-attribute` ¶

trust_remote_code: bool = False

Trust remote code for HF datasets and models.

confirm_run_unsafe_code `class-attribute` `instance-attribute` ¶

confirm_run_unsafe_code: bool = False

Confirm understanding of unsafe code risks (for tasks that execute arbitrary Python).

metadata `class-attribute` `instance-attribute` ¶

metadata: dict = field(default_factory=dict)

Additional metadata for tasks that require it.

Functions¶

from_cli `classmethod` ¶

from_cli(namespace: Namespace) -> EvaluatorConfig

Build an EvaluationConfig by merging with a simple precedence.

CLI args > YAML config > built-in defaults.

Source code in lm_eval/config/evaluate_config.py

@classmethod
def from_cli(cls, namespace: Namespace) -> EvaluatorConfig:
    """Build an EvaluationConfig by merging with a simple precedence.

    CLI args > YAML config > built-in defaults.
    """
    # Start with built-in defaults
    config = asdict(cls())

    # Load and merge YAML config if provided
    if used_config := getattr(namespace, "config", None):
        config.update(cls.load_yaml_config(used_config))

    # Override with CLI args (skip None = "not provided", exclude non-config args)
    excluded_args = {"command", "func"}  # argparse internal args
    cli_args = {
        k: v
        for k, v in vars(namespace).items()
        if v is not None and k not in excluded_args
    }
    config.update(cli_args)

    # Create an instance and validate
    instance = cls(**config)._parse_dict_args()
    instance._configure()

    if used_config:
        cli_args.pop("config", None)
        eval_logger.info(
            "CLI args %s will override yaml", cli_args
        ) if cli_args else None
        print(textwrap.dedent(f"""{instance}"""))

    return instance

from_config `classmethod` ¶

from_config(config_path: str | Path) -> EvaluatorConfig

Build an EvaluationConfig from a YAML config file.

Merges with built-in defaults and validates.

Source code in lm_eval/config/evaluate_config.py

@classmethod
def from_config(cls, config_path: str | Path) -> EvaluatorConfig:
    """Build an EvaluationConfig from a YAML config file.

    Merges with built-in defaults and validates.
    """
    # Load YAML config
    yaml_config = cls.load_yaml_config(config_path)
    return cls(**yaml_config)._configure()

load_yaml_config `staticmethod` ¶

load_yaml_config(config_path: str | Path) -> dict[str, Any]

Load and validate YAML config file.

Source code in lm_eval/config/evaluate_config.py

@staticmethod
def load_yaml_config(config_path: str | Path) -> dict[str, Any]:
    """Load and validate YAML config file."""
    _config_path = Path(config_path)
    if not _config_path.is_file():
        raise FileNotFoundError(f"Config file not found: {_config_path.resolve()}")

    try:
        yaml_data = yaml.safe_load(_config_path.read_text())
    except yaml.YAMLError as e:
        raise ValueError(f"Invalid YAML in {_config_path}: {e}") from e
    except (OSError, UnicodeDecodeError) as e:
        raise ValueError(f"Could not read config file {_config_path}: {e}") from e

    if not isinstance(yaml_data, dict):
        raise TypeError(
            f"YAML root must be a mapping in {_config_path.resolve()}, got {type(yaml_data).__name__}"
        )

    return yaml_data

process_tasks ¶

process_tasks(metadata: dict | None = None) -> TaskManager

Process and validate tasks, return resolved task names.

Handles: - Task names (e.g., "hellaswag", "arc_easy") - Custom YAML config files (e.g., "/path/to/task.yaml") - Glob patterns (e.g., "/path/to/*.yaml") - Directories of YAML files

Source code in lm_eval/config/evaluate_config.py

def process_tasks(self, metadata: dict | None = None) -> TaskManager:
    """Process and validate tasks, return resolved task names.

    Handles:
    - Task names (e.g., "hellaswag", "arc_easy")
    - Custom YAML config files (e.g., "/path/to/task.yaml")
    - Glob patterns (e.g., "/path/to/*.yaml")
    - Directories of YAML files
    """
    import itertools

    from lm_eval.tasks import TaskManager
    from lm_eval.tasks._yaml_loader import load_yaml

    # if metadata manually passed use that:
    self.metadata = metadata or self.metadata

    # Create a task manager with metadata
    task_manager = TaskManager(
        include_path=self.include_path,
        include_defaults=self.include_defaults,
        metadata=self.metadata or {},
    )

    # Normalize tasks to a list
    # We still allow tasks in the form task1,task2
    task_list = (
        [t.strip() for t in self.tasks.split(",")]
        if isinstance(self.tasks, str)
        else [t.strip() for task in self.tasks for t in task.split(",")]
    )

    # Handle directory input
    if len(task_list) == 1 and (yaml_path := Path(task_list[0])).is_dir():
        task_names = []
        for yaml_file in yaml_path.glob("*.yaml"):
            config = load_yaml(yaml_file, resolve_func=False)
            task_names.append(config)
        self.tasks = task_names
        return task_manager

    # Normalize paths and deduplicate
    task_list = [
        str(Path(task).absolute()) if task.endswith(".yaml") else task
        for task in task_list
    ]
    match_dict: dict[str, list] = {}

    # Match each task
    for task in task_list:
        if not task.endswith(".yaml"):
            # Standard task name - match via task manager
            matches = task_manager.match_tasks([task])
        else:
            # Custom config file(s) - support glob patterns
            matches = []
            for yaml_file in (task_path := Path(task)).parent.glob(task_path.name):
                config = load_yaml(yaml_file, resolve_func=False)
                matches.append(config)
        match_dict[task] = matches

    # Flatten and deduplicate results
    task_names = []
    for task in itertools.chain.from_iterable(match_dict.values()):
        if task not in task_names:
            task_names.append(task)

    # Check for missing tasks
    task_missing = [task for task, matches in match_dict.items() if not matches]
    if task_missing:
        missing = ", ".join(task_missing)
        raise ValueError(f"Tasks not found: {missing}")

    # Update tasks with resolved names
    self.tasks = task_names
    return task_manager

Evaluation Config¶

evaluate_config ¶

Attributes¶

eval_logger module-attribute ¶

DICT_KEYS module-attribute ¶

Classes¶

EvaluatorConfig dataclass ¶

Attributes¶

config class-attribute instance-attribute ¶

model class-attribute instance-attribute ¶

model_args class-attribute instance-attribute ¶

tasks class-attribute instance-attribute ¶

num_fewshot class-attribute instance-attribute ¶

repeats class-attribute instance-attribute ¶

batch_size class-attribute instance-attribute ¶

max_batch_size class-attribute instance-attribute ¶

device class-attribute instance-attribute ¶

limit class-attribute instance-attribute ¶

samples class-attribute instance-attribute ¶

use_cache class-attribute instance-attribute ¶

cache_requests class-attribute instance-attribute ¶

check_integrity class-attribute instance-attribute ¶

write_out class-attribute instance-attribute ¶

log_samples class-attribute instance-attribute ¶

output_path class-attribute instance-attribute ¶

predict_only class-attribute instance-attribute ¶

system_instruction class-attribute instance-attribute ¶

apply_chat_template class-attribute instance-attribute ¶

fewshot_as_multiturn class-attribute instance-attribute ¶

show_config class-attribute instance-attribute ¶

include_path class-attribute instance-attribute ¶

include_defaults class-attribute instance-attribute ¶

gen_kwargs class-attribute instance-attribute ¶

verbosity class-attribute instance-attribute ¶

wandb_args class-attribute instance-attribute ¶

wandb_config_args class-attribute instance-attribute ¶

hf_hub_log_args class-attribute instance-attribute ¶

seed class-attribute instance-attribute ¶

trust_remote_code class-attribute instance-attribute ¶

confirm_run_unsafe_code class-attribute instance-attribute ¶

metadata class-attribute instance-attribute ¶

Functions¶

from_cli classmethod ¶

from_config classmethod ¶

load_yaml_config staticmethod ¶

process_tasks ¶

Functions¶

eval_logger `module-attribute` ¶

DICT_KEYS `module-attribute` ¶

EvaluatorConfig `dataclass` ¶

config `class-attribute` `instance-attribute` ¶

model `class-attribute` `instance-attribute` ¶

model_args `class-attribute` `instance-attribute` ¶

tasks `class-attribute` `instance-attribute` ¶

num_fewshot `class-attribute` `instance-attribute` ¶

repeats `class-attribute` `instance-attribute` ¶

batch_size `class-attribute` `instance-attribute` ¶

max_batch_size `class-attribute` `instance-attribute` ¶

device `class-attribute` `instance-attribute` ¶

limit `class-attribute` `instance-attribute` ¶

samples `class-attribute` `instance-attribute` ¶

use_cache `class-attribute` `instance-attribute` ¶

cache_requests `class-attribute` `instance-attribute` ¶

check_integrity `class-attribute` `instance-attribute` ¶

write_out `class-attribute` `instance-attribute` ¶

log_samples `class-attribute` `instance-attribute` ¶

output_path `class-attribute` `instance-attribute` ¶

predict_only `class-attribute` `instance-attribute` ¶

system_instruction `class-attribute` `instance-attribute` ¶

apply_chat_template `class-attribute` `instance-attribute` ¶

fewshot_as_multiturn `class-attribute` `instance-attribute` ¶

show_config `class-attribute` `instance-attribute` ¶

include_path `class-attribute` `instance-attribute` ¶

include_defaults `class-attribute` `instance-attribute` ¶

gen_kwargs `class-attribute` `instance-attribute` ¶

verbosity `class-attribute` `instance-attribute` ¶

wandb_args `class-attribute` `instance-attribute` ¶

wandb_config_args `class-attribute` `instance-attribute` ¶

hf_hub_log_args `class-attribute` `instance-attribute` ¶

seed `class-attribute` `instance-attribute` ¶

trust_remote_code `class-attribute` `instance-attribute` ¶

confirm_run_unsafe_code `class-attribute` `instance-attribute` ¶

metadata `class-attribute` `instance-attribute` ¶

from_cli `classmethod` ¶

from_config `classmethod` ¶

load_yaml_config `staticmethod` ¶