Skip to content

Python API

This guide covers programmatic usage of the evaluation harness in Python scripts and applications.

Overview

The library provides three main ways to run evaluations programmatically:

Function Use Case
simple_evaluate() Most common — accepts model name strings or LM objects
EvaluatorConfig Config-based — load settings from YAML or dataclass
evaluate() Low-level — full control over task dictionaries

Using simple_evaluate()

The simple_evaluate() function is the recommended entry point for most use cases.

Basic Usage

import lm_eval

results = lm_eval.simple_evaluate(
    model="hf",
    model_args="pretrained=gpt2,dtype=float32",
    tasks=["hellaswag", "arc_easy"],
    num_fewshot=5,
    batch_size=8,
    device="cuda:0",
)

With a Pre-initialized Model

import lm_eval
from lm_eval.models.huggingface import HFLM

# Initialize model separately
lm = HFLM(pretrained="gpt2", batch_size=16)

results = lm_eval.simple_evaluate(
    model=lm,
    tasks=["hellaswag"],
    num_fewshot=0,
)

With External Tasks

import lm_eval
from lm_eval.tasks import TaskManager

# Include custom task definitions
task_manager = TaskManager(include_path="/path/to/custom/tasks")

results = lm_eval.simple_evaluate(
    model="hf",
    model_args="pretrained=gpt2",
    tasks=["my_custom_task"],
    task_manager=task_manager,
)

Common Parameters

Parameter Type Description
model str or LM Model name (e.g., "hf", "vllm") or LM instance
model_args str or dict Model constructor arguments
tasks list[str] Task names to evaluate
num_fewshot int Number of few-shot examples
batch_size int or str Batch size or "auto"
device str Device (cuda, cpu, mps)
limit int or float Limit examples per task
repeats int Number of repeated runs per sample (for self-consistency)
log_samples bool Save model inputs/outputs (default: True)
task_manager TaskManager For external tasks
gen_kwargs dict Generation arguments
apply_chat_template bool or str Use chat template
system_instruction str System prompt
fewshot_as_multiturn bool Multi-turn few-shot

Return Value

simple_evaluate() returns a dictionary with:

{
    "results": {
        "task_name": {
            "metric_name,filter_name": value,
            "metric_name,filter_name_stderr": stderr_value,
        }
    },
    "configs": {...},      # Task configurations
    "versions": {...},     # Task versions
    "n-shot": {...},       # Few-shot counts
    "higher_is_better": {...},
    "n-samples": {...},
    "samples": {...},      # If log_samples=True
}

Using TaskManager.load()

The TaskManager.load() method is the modern way to build task dictionaries. It supports name strings, config dicts, runtime overrides, and the @format / :: path syntax.

Basic Usage

from lm_eval.tasks import TaskManager

tm = TaskManager()
loaded = tm.load(["hellaswag", "arc_easy"])

loaded["tasks"]      # {"hellaswag": Task, "arc_easy": Task}
loaded["groups"]     # {} (no groups requested)
loaded["group_map"]  # {}

Loading Groups

When you request a group, load() expands it into its leaf tasks and tracks the group structure:

loaded = tm.load(["mmlu"])

loaded["tasks"]      # {"mmlu_abstract_algebra": Task, "mmlu_anatomy": Task, ...}
loaded["groups"]     # {"mmlu": Group}
loaded["group_map"]  # {"mmlu": ["mmlu_abstract_algebra", "mmlu_anatomy", ...]}

Runtime Overrides

Override task config fields at load time without modifying YAML files:

loaded = tm.load(
    ["mmlu", "arc_easy"],
    overrides={
        "arc_easy": {"num_fewshot": 5},
        "mmlu": {"num_fewshot": 3},
    },
)

Overrides apply to the named task or group. For groups, overrides propagate to all child tasks at evaluation time. Note that group_map itself only records each group's direct children — recursive propagation is handled by the evaluator.

Inline Config Dicts

You can mix name strings with full config dicts:

loaded = tm.load([
    "hellaswag",                    # by name
    {                               # inline config
        "task": "my_task",
        "dataset_path": "my_org/my_data",
        "test_split": "test",
        "doc_to_text": "question",
        "doc_to_target": "answer",
    },
])

Return Type: TaskDict

TaskManager.load() returns a TaskDict TypedDict:

Key Type Description
tasks dict[str, Task] Flat mapping of every leaf task name to its Task object
groups dict[str, Group] Flat mapping of every group name to its Group object
group_map dict[str, list[str]] Each group's direct children (not recursive)

Using EvaluatorConfig

The EvaluatorConfig class provides a structured way to manage evaluation settings.

From YAML File

from lm_eval.config.evaluate_config import EvaluatorConfig
import lm_eval

# Load configuration from YAML
config = EvaluatorConfig.from_config("eval_config.yaml")

# Process tasks
task_manager = config.process_tasks()

# Run evaluation
results = lm_eval.simple_evaluate(
    model=config.model,
    model_args=config.model_args,
    tasks=config.tasks,
    num_fewshot=config.num_fewshot,
    batch_size=config.batch_size,
    device=config.device,
    task_manager=task_manager,
    log_samples=config.log_samples,
    gen_kwargs=config.gen_kwargs,
    apply_chat_template=config.apply_chat_template,
    system_instruction=config.system_instruction,
)

Direct Instantiation

from lm_eval.config.evaluate_config import EvaluatorConfig

config = EvaluatorConfig(
    model="hf",
    model_args={"pretrained": "gpt2", "dtype": "float32"},
    tasks=["hellaswag", "arc_easy"],
    num_fewshot=5,
    batch_size=8,
    device="cuda:0",
    output_path="./results/",
    log_samples=True,
)

task_manager = config.process_tasks()

See the Configuration Guide for all available fields.


Using evaluate()

The evaluate() function provides lower-level control, accepting pre-built task dictionaries.

import lm_eval
from lm_eval.tasks import TaskManager
from lm_eval.models.huggingface import HFLM

# Initialize model
lm = HFLM(pretrained="gpt2", batch_size=16)

# Build task dictionary using TaskManager.load()
tm = TaskManager()
loaded = tm.load(["hellaswag", "my_custom_task"])

# Run evaluation
results = lm_eval.evaluate(
    lm=lm,
    task_dict=loaded,
    num_fewshot=5,
    limit=100,
)

Custom Models

To evaluate a custom model, create a subclass of lm_eval.api.model.LM:

from lm_eval.api.model import LM

class MyCustomLM(LM):
    def __init__(self, model, batch_size=1):
        super().__init__()
        self.model = model
        self._batch_size = batch_size

    def loglikelihood(self, requests):
        # Return list of (logprob, is_greedy) tuples
        ...

    def generate_until(self, requests):
        # Return list of generated strings
        ...

    def loglikelihood_rolling(self, requests):
        # Return list of (logprob, is_greedy) tuples
        ...

    @property
    def batch_size(self):
        return self._batch_size

Then use it with simple_evaluate():

my_model = load_my_model()
lm = MyCustomLM(model=my_model, batch_size=16)

results = lm_eval.simple_evaluate(
    model=lm,
    tasks=["hellaswag"],
)

For detailed guidance on implementing custom models, see the Custom Model Backend guide.


Logging

Configure logging for debugging:

from lm_eval.utils import setup_logging

# Set log level
setup_logging("DEBUG")  # DEBUG, INFO, WARNING, ERROR

# Or use environment variable
import os
os.environ["LMEVAL_LOG_LEVEL"] = "DEBUG"

Examples

Batch Evaluation of Multiple Models

import lm_eval

models = [
    "gpt2",
    "gpt2-medium",
    "gpt2-large",
]

all_results = {}
for model_name in models:
    results = lm_eval.simple_evaluate(
        model="hf",
        model_args=f"pretrained={model_name}",
        tasks=["hellaswag"],
        batch_size="auto",
    )
    all_results[model_name] = results["results"]

Save and Load Results

import json
import lm_eval
from lm_eval.utils import handle_non_serializable

results = lm_eval.simple_evaluate(
    model="hf",
    model_args="pretrained=gpt2",
    tasks=["hellaswag"],
)

# Save results
with open("results.json", "w") as f:
    json.dump(results, f, default=handle_non_serializable, indent=2)