LM Evaluation Harness¶

A unified framework for evaluating generative language models on a large number of different evaluation tasks.

Features¶

60+ benchmarks with hundreds of subtasks, including MMLU, HellaSwag, GSM8K, ARC, and more
Multiple backends — HuggingFace Transformers, vLLM, OpenAI-compatible APIs, and custom models
YAML-based configuration with Jinja2 templating and declarative prompt formats — use formats: mcqa to auto-generate A/B/C/D prompts from simple field mappings, or try different formats at runtime with --tasks my_task@generate
Reproducible evaluations with published prompts, versioning, and shareable configs
Extensible scoring — pluggable scorers, metrics, and filter pipelines

pip install lm-eval[hf]
lm-eval run --model hf --model_args pretrained=gpt2 --tasks hellaswag

import lm_eval

results = lm_eval.simple_evaluate(
    model="hf",
    model_args="pretrained=gpt2",
    tasks=["hellaswag"],
)

See the Quickstart guide for a complete walkthrough.

I want to...	Start here
Get up and running	Quickstart
Run evaluations from CLI or Python	CLI Reference / Python API
Create or customize evaluation tasks	Your First Task
Use prompt formats to simplify task authoring	Prompt Formats
Add a model backend, scorer, or metric	Custom Model / Custom Scorers
Upgrade from v0.4	Migrating from v0.4
Browse the API reference	API Reference
Contribute to the project	Contributing