Skip to content

Evaluation Config

Configuration dataclass for managing evaluation settings. Can be instantiated directly or loaded from a YAML file.

Source

evaluate_config

Attributes

eval_logger module-attribute

eval_logger = getLogger(__name__)

DICT_KEYS module-attribute

DICT_KEYS = ['wandb_args', 'wandb_config_args', 'hf_hub_log_args', 'metadata', 'model_args', 'gen_kwargs']

Classes

EvaluatorConfig dataclass

EvaluatorConfig(config: str | None = None, model: str = 'hf', model_args: dict = dict(), tasks: str | list[str] = list(), num_fewshot: int | None = None, repeats: int | None = None, batch_size: int = 1, max_batch_size: int | None = None, device: str | None = 'cuda:0', limit: float | None = None, samples: str | dict | None = None, use_cache: str | None = None, cache_requests: dict = dict(), check_integrity: bool = False, write_out: bool = False, log_samples: bool = False, output_path: str | None = None, predict_only: bool = False, system_instruction: str | None = None, apply_chat_template: bool | str = False, fewshot_as_multiturn: bool | None = None, show_config: bool = False, include_path: str | None = None, include_defaults: bool = True, gen_kwargs: dict = dict(), verbosity: str | None = None, wandb_args: dict = dict(), wandb_config_args: dict = dict(), hf_hub_log_args: dict = dict(), seed: list = (lambda: [0, 1234, 1234, 1234])(), trust_remote_code: bool = False, confirm_run_unsafe_code: bool = False, metadata: dict = dict())

Configuration for language model evaluation runs.

This dataclass contains all parameters for configuring model evaluations via simple_evaluate or the CLI. It supports initialization from:

  • CLI arguments (via from_cli)
  • YAML configuration files (via from_config)
  • Direct instantiation with keyword arguments

The configuration handles argument parsing, validation, and preprocessing to ensure properly structured and validated.

Example
# From CLI arguments
config = EvaluatorConfig.from_cli(args)

# From YAML file
config = EvaluatorConfig.from_config("eval_config.yaml")

# Direct instantiation
config = EvaluatorConfig(
    model="hf",
    model_args={"pretrained": "gpt2"},
    tasks=["hellaswag", "arc_easy"],
    num_fewshot=5,
)
Attributes
config class-attribute instance-attribute
config: str | None = None

Path to a YAML config file. CLI args override values from the file.

model class-attribute instance-attribute
model: str = 'hf'

Name of the model backend (e.g. "hf", "vllm", "openai").

model_args class-attribute instance-attribute
model_args: dict = field(default_factory=dict)

Arguments for model initialization, passed to the model constructor.

tasks class-attribute instance-attribute
tasks: str | list[str] = field(default_factory=list)

Task names to evaluate. Accepts a comma-separated string or a list.

num_fewshot class-attribute instance-attribute
num_fewshot: int | None = None

Number of examples in few-shot context.

repeats class-attribute instance-attribute
repeats: int | None = None

Number of repeats for each request (overrides task config).

batch_size class-attribute instance-attribute
batch_size: int = 1

Batch size for evaluation.

max_batch_size class-attribute instance-attribute
max_batch_size: int | None = None

Maximum batch size for auto batching.

device class-attribute instance-attribute
device: str | None = 'cuda:0'

Device to use (e.g. "cuda", "cuda:0", "cpu").

limit class-attribute instance-attribute
limit: float | None = None

Limit number of examples per task. Mutually exclusive with samples.

samples class-attribute instance-attribute
samples: str | dict | None = None

Dict, JSON string, or path to a JSON file mapping task names to doc indices.

use_cache class-attribute instance-attribute
use_cache: str | None = None

Path to a SQLite DB file for caching model outputs.

cache_requests class-attribute instance-attribute
cache_requests: dict = field(default_factory=dict)

Cache dataset requests. Values: true / "refresh" / "delete".

check_integrity class-attribute instance-attribute
check_integrity: bool = False

Run the test suite for tasks.

write_out class-attribute instance-attribute
write_out: bool = False

Print prompts for the first few documents.

log_samples class-attribute instance-attribute
log_samples: bool = False

Save model outputs and inputs. Requires output_path.

output_path class-attribute instance-attribute
output_path: str | None = None

Directory path where result metrics will be saved.

predict_only class-attribute instance-attribute
predict_only: bool = False

Only save model outputs without evaluating metrics. Implies log_samples.

system_instruction class-attribute instance-attribute
system_instruction: str | None = None

Custom system instruction prepended to every prompt.

apply_chat_template class-attribute instance-attribute
apply_chat_template: bool | str = False

Apply chat template to the prompt. Either True, or a string naming the tokenizer template.

fewshot_as_multiturn class-attribute instance-attribute
fewshot_as_multiturn: bool | None = None

Use fewshot examples as multi-turn conversation. Defaults to True when apply_chat_template is set.

show_config class-attribute instance-attribute
show_config: bool = False

Show the full config at the end of evaluation.

include_path class-attribute instance-attribute
include_path: str | None = None

Additional directory path for external tasks.

include_defaults class-attribute instance-attribute
include_defaults: bool = True

Whether to include built-in tasks from lm_eval/tasks/.

gen_kwargs class-attribute instance-attribute
gen_kwargs: dict = field(default_factory=dict)

Generation arguments passed to the model. Overrides task-level defaults.

verbosity class-attribute instance-attribute
verbosity: str | None = None

Logging verbosity level.

wandb_args class-attribute instance-attribute
wandb_args: dict = field(default_factory=dict)

Arguments for wandb.init.

wandb_config_args class-attribute instance-attribute
wandb_config_args: dict = field(default_factory=dict)

Arguments for wandb.config.update.

hf_hub_log_args class-attribute instance-attribute
hf_hub_log_args: dict = field(default_factory=dict)

Arguments for HF Hub logging.

seed class-attribute instance-attribute
seed: list = field(default_factory=lambda: [0, 1234, 1234, 1234])

Seeds as [random, numpy, torch, fewshot].

trust_remote_code class-attribute instance-attribute
trust_remote_code: bool = False

Trust remote code for HF datasets and models.

confirm_run_unsafe_code class-attribute instance-attribute
confirm_run_unsafe_code: bool = False

Confirm understanding of unsafe code risks (for tasks that execute arbitrary Python).

metadata class-attribute instance-attribute
metadata: dict = field(default_factory=dict)

Additional metadata for tasks that require it.

Functions
from_cli classmethod
from_cli(namespace: Namespace) -> EvaluatorConfig

Build an EvaluationConfig by merging with a simple precedence.

CLI args > YAML config > built-in defaults.

Source code in lm_eval/config/evaluate_config.py
@classmethod
def from_cli(cls, namespace: Namespace) -> EvaluatorConfig:
    """Build an EvaluationConfig by merging with a simple precedence.

    CLI args > YAML config > built-in defaults.
    """
    # Start with built-in defaults
    config = asdict(cls())

    # Load and merge YAML config if provided
    if used_config := getattr(namespace, "config", None):
        config.update(cls.load_yaml_config(used_config))

    # Override with CLI args (skip None = "not provided", exclude non-config args)
    excluded_args = {"command", "func"}  # argparse internal args
    cli_args = {
        k: v
        for k, v in vars(namespace).items()
        if v is not None and k not in excluded_args
    }
    config.update(cli_args)

    # Create an instance and validate
    instance = cls(**config)._parse_dict_args()
    instance._configure()

    if used_config:
        cli_args.pop("config", None)
        eval_logger.info(
            "CLI args %s will override yaml", cli_args
        ) if cli_args else None
        print(textwrap.dedent(f"""{instance}"""))

    return instance
from_config classmethod
from_config(config_path: str | Path) -> EvaluatorConfig

Build an EvaluationConfig from a YAML config file.

Merges with built-in defaults and validates.

Source code in lm_eval/config/evaluate_config.py
@classmethod
def from_config(cls, config_path: str | Path) -> EvaluatorConfig:
    """Build an EvaluationConfig from a YAML config file.

    Merges with built-in defaults and validates.
    """
    # Load YAML config
    yaml_config = cls.load_yaml_config(config_path)
    return cls(**yaml_config)._configure()
load_yaml_config staticmethod
load_yaml_config(config_path: str | Path) -> dict[str, Any]

Load and validate YAML config file.

Source code in lm_eval/config/evaluate_config.py
@staticmethod
def load_yaml_config(config_path: str | Path) -> dict[str, Any]:
    """Load and validate YAML config file."""
    _config_path = Path(config_path)
    if not _config_path.is_file():
        raise FileNotFoundError(f"Config file not found: {_config_path.resolve()}")

    try:
        yaml_data = yaml.safe_load(_config_path.read_text())
    except yaml.YAMLError as e:
        raise ValueError(f"Invalid YAML in {_config_path}: {e}") from e
    except (OSError, UnicodeDecodeError) as e:
        raise ValueError(f"Could not read config file {_config_path}: {e}") from e

    if not isinstance(yaml_data, dict):
        raise TypeError(
            f"YAML root must be a mapping in {_config_path.resolve()}, got {type(yaml_data).__name__}"
        )

    return yaml_data
process_tasks
process_tasks(metadata: dict | None = None) -> TaskManager

Process and validate tasks, return resolved task names.

Handles: - Task names (e.g., "hellaswag", "arc_easy") - Custom YAML config files (e.g., "/path/to/task.yaml") - Glob patterns (e.g., "/path/to/*.yaml") - Directories of YAML files

Source code in lm_eval/config/evaluate_config.py
def process_tasks(self, metadata: dict | None = None) -> TaskManager:
    """Process and validate tasks, return resolved task names.

    Handles:
    - Task names (e.g., "hellaswag", "arc_easy")
    - Custom YAML config files (e.g., "/path/to/task.yaml")
    - Glob patterns (e.g., "/path/to/*.yaml")
    - Directories of YAML files
    """
    import itertools

    from lm_eval.tasks import TaskManager
    from lm_eval.tasks._yaml_loader import load_yaml

    # if metadata manually passed use that:
    self.metadata = metadata or self.metadata

    # Create a task manager with metadata
    task_manager = TaskManager(
        include_path=self.include_path,
        include_defaults=self.include_defaults,
        metadata=self.metadata or {},
    )

    # Normalize tasks to a list
    # We still allow tasks in the form task1,task2
    task_list = (
        [t.strip() for t in self.tasks.split(",")]
        if isinstance(self.tasks, str)
        else [t.strip() for task in self.tasks for t in task.split(",")]
    )

    # Handle directory input
    if len(task_list) == 1 and (yaml_path := Path(task_list[0])).is_dir():
        task_names = []
        for yaml_file in yaml_path.glob("*.yaml"):
            config = load_yaml(yaml_file, resolve_func=False)
            task_names.append(config)
        self.tasks = task_names
        return task_manager

    # Normalize paths and deduplicate
    task_list = [
        str(Path(task).absolute()) if task.endswith(".yaml") else task
        for task in task_list
    ]
    match_dict: dict[str, list] = {}

    # Match each task
    for task in task_list:
        if not task.endswith(".yaml"):
            # Standard task name - match via task manager
            matches = task_manager.match_tasks([task])
        else:
            # Custom config file(s) - support glob patterns
            matches = []
            for yaml_file in (task_path := Path(task)).parent.glob(task_path.name):
                config = load_yaml(yaml_file, resolve_func=False)
                matches.append(config)
        match_dict[task] = matches

    # Flatten and deduplicate results
    task_names = []
    for task in itertools.chain.from_iterable(match_dict.values()):
        if task not in task_names:
            task_names.append(task)

    # Check for missing tasks
    task_missing = [task for task, matches in match_dict.items() if not matches]
    if task_missing:
        missing = ", ".join(task_missing)
        raise ValueError(f"Tasks not found: {missing}")

    # Update tasks with resolved names
    self.tasks = task_names
    return task_manager

Functions