Skip to content

LM Evaluation Harness

A unified framework for evaluating generative language models on a large number of different evaluation tasks.

Features

  • 60+ benchmarks with hundreds of subtasks, including MMLU, HellaSwag, GSM8K, ARC, and more
  • Multiple backends — HuggingFace Transformers, vLLM, OpenAI-compatible APIs, and custom models
  • YAML-based configuration with Jinja2 templating and declarative prompt formats — use formats: mcqa to auto-generate A/B/C/D prompts from simple field mappings, or try different formats at runtime with --tasks my_task@generate
  • Reproducible evaluations with published prompts, versioning, and shareable configs
  • Extensible scoring — pluggable scorers, metrics, and filter pipelines

Quick start

pip install lm-eval[hf]
lm-eval run --model hf --model_args pretrained=gpt2 --tasks hellaswag
import lm_eval

results = lm_eval.simple_evaluate(
    model="hf",
    model_args="pretrained=gpt2",
    tasks=["hellaswag"],
)

See the Quickstart guide for a complete walkthrough.

Documentation

I want to... Start here
Get up and running Quickstart
Run evaluations from CLI or Python CLI Reference / Python API
Create or customize evaluation tasks Your First Task
Use prompt formats to simplify task authoring Prompt Formats
Add a model backend, scorer, or metric Custom Model / Custom Scorers
Upgrade from v0.4 Migrating from v0.4
Browse the API reference API Reference
Contribute to the project Contributing