LM Evaluation Harness¶
A unified framework for evaluating generative language models on a large number of different evaluation tasks.
Features¶
- 60+ benchmarks with hundreds of subtasks, including MMLU, HellaSwag, GSM8K, ARC, and more
- Multiple backends — HuggingFace Transformers, vLLM, OpenAI-compatible APIs, and custom models
- YAML-based configuration with Jinja2 templating and declarative prompt formats — use
formats: mcqato auto-generate A/B/C/D prompts from simple field mappings, or try different formats at runtime with--tasks my_task@generate - Reproducible evaluations with published prompts, versioning, and shareable configs
- Extensible scoring — pluggable scorers, metrics, and filter pipelines
Quick start¶
import lm_eval
results = lm_eval.simple_evaluate(
model="hf",
model_args="pretrained=gpt2",
tasks=["hellaswag"],
)
See the Quickstart guide for a complete walkthrough.
Documentation¶
| I want to... | Start here |
|---|---|
| Get up and running | Quickstart |
| Run evaluations from CLI or Python | CLI Reference / Python API |
| Create or customize evaluation tasks | Your First Task |
| Use prompt formats to simplify task authoring | Prompt Formats |
| Add a model backend, scorer, or metric | Custom Model / Custom Scorers |
| Upgrade from v0.4 | Migrating from v0.4 |
| Browse the API reference | API Reference |
| Contribute to the project | Contributing |