Quickstart¶
Get from zero to your first evaluation results in under 5 minutes.
Install¶
This installs the harness plus the HuggingFace Transformers backend. See Installation for other backends and options.
Run your first evaluation¶
Evaluate GPT-2 on HellaSwag using the CLI:
Tip
--limit 100 evaluates on only 100 samples for a quick test. Remove it for a full evaluation run.
Or from Python:
import lm_eval
results = lm_eval.simple_evaluate(
model="hf",
model_args="pretrained=gpt2",
tasks=["hellaswag"],
limit=100,
)
print(results["results"])
Reading the output¶
After an evaluation completes, you'll see a results table like:
| Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
|----------|------:|------|-----:|--------|-----:|---|-----:|
|hellaswag | 1|none | 0|acc |0.2891|± |0.0045|
| | |none | 0|acc_norm|0.3108|± |0.0046|
Key columns:
- Tasks: The evaluation task name
- Filter: Which output filter pipeline produced this result (e.g.,
nonemeans no post-processing) - n-shot: Number of few-shot examples used
- Metric: The scoring metric —
accis raw accuracy,acc_normis length-normalized accuracy - Value: The score (0–1 scale)
- Stderr: Standard error of the mean, computed via bootstrap
Try different prompt formats¶
One of the most powerful features in lm-eval is prompt formats — they let you change how prompts are assembled without editing YAML. Just append @format_name to any task:
# Standard A/B/C/D multiple-choice
lm-eval run --model hf --model_args pretrained=gpt2 --tasks hellaswag@mcqa --limit 100
# Cloze-style (no choice labels)
lm-eval run --model hf --model_args pretrained=gpt2 --tasks hellaswag@cloze --limit 100
# Free generation with answer extraction
lm-eval run --model hf --model_args pretrained=gpt2 --tasks hellaswag@generate --limit 100
Built-in formats: mcqa, cloze, generate, cot (chain-of-thought). See the Prompt Formats guide for details.
Explore available tasks¶
List all built-in tasks:
List tasks matching a pattern:
Where next?¶
| I want to... | Go to |
|---|---|
| Run evaluations with different models, settings, and options | Running Evaluations |
| Create a new evaluation task or customize an existing one | Writing Tasks |
| Add a new model backend or extend the scoring pipeline | Extending the Framework |
| Understand the evaluation pipeline architecture | Concepts & Architecture |