CLI Reference
The lm-eval CLI is organized into subcommands:
| Command |
Description |
lm-eval run |
Run evaluations on language models |
lm-eval ls |
List available tasks, groups, subtasks, or tags |
lm-eval validate |
Validate task configurations |
Run the library via the lm-eval entrypoint or python -m lm_eval.
Use -h or --help to see available options:
lm-eval -h # Show all subcommands
lm-eval run -h # Show options for run command
lm-eval ls -h # Show options for list command
Legacy Compatibility: The original single-command interface still works. Running lm-eval --model hf --tasks hellaswag automatically inserts the run subcommand.
lm-eval run
Run evaluations on language models.
lm-eval run --model <model> --tasks <task> [options]
Examples
# Basic evaluation with HuggingFace model
lm-eval run --model hf --model_args pretrained=gpt2 dtype=float32 --tasks hellaswag
# Multiple tasks with few-shot examples
lm-eval run --model vllm --model_args pretrained=EleutherAI/gpt-j-6B --tasks arc_easy arc_challenge --num_fewshot 5
# Custom generation parameters
lm-eval run --model hf --model_args pretrained=gpt2 --tasks lambada --gen_kwargs temperature=0.8 top_p=0.95
# Use a YAML configuration file
lm-eval run --config my_config.yaml --tasks mmlu
Task selection
Tasks can be specified by name, group, tag, or path. You can also apply formats and navigate nested groups using special syntax:
# Apply a format at runtime with @
lm-eval run --tasks my_task@mcqa --model hf --model_args pretrained=gpt2
# Try different prompt formats on the same dataset
lm-eval run --tasks hellaswag@generate --model hf --model_args pretrained=gpt2
lm-eval run --tasks hellaswag@cloze --model hf --model_args pretrained=gpt2
# Address a specific subtask within a group with ::
lm-eval run --tasks mmlu::mmlu_anatomy --model hf --model_args pretrained=gpt2
The @format suffix tells lm-eval to apply a prompt format (e.g., mcqa, cloze, generate, cot) at runtime without modifying the task YAML.
The :: path syntax navigates nested groups: group::subgroup::task.
Model and Tasks
| Argument |
Short |
Description |
--model |
-M |
Model type/provider name (default: hf). See supported models. |
--model_args |
-a |
Model constructor arguments as key=val key2=val2 or key=val,key2=val2. |
--tasks |
-t |
Space or comma-separated list of task names, groups, or tags. Supports @format suffix and :: path syntax. |
--apply_chat_template |
|
Apply chat template to prompts. Use without argument for default template, or specify template name. |
--limit |
-L |
Limit examples per task. Integer for count, float (0.0-1.0) for percentage. For testing only. |
--use_cache |
-c |
Path prefix for SQLite cache of model responses (e.g., /path/to/cache_). |
Evaluation Settings
| Argument |
Short |
Description |
--num_fewshot |
-f |
Number of few-shot examples in context. |
--batch_size |
-b |
Batch size: integer, auto, or auto:N to auto-tune N times (default: 1). |
--max_batch_size |
|
Maximum batch size when using --batch_size auto. |
--device |
|
Device to use: cuda, cuda:0, cpu, mps (default: cuda). |
--gen_kwargs |
|
Generation arguments as key=val key2=val2. Values parsed with ast.literal_eval. Example: temperature=0.8 'stop=["\n\n"]' |
Data and Output
| Argument |
Short |
Description |
--output_path |
-o |
Output directory or JSON file for results. Required with --log_samples. |
--log_samples |
-s |
Save all model inputs/outputs for post-hoc analysis. |
--samples |
-E |
JSON mapping task names to sample indices, e.g., '{"task1": [0,1,2]}'. Incompatible with --limit. |
| Argument |
Description |
--cache_requests |
Cache preprocessed prompts: true, refresh, or delete. Cached files stored in lm_eval/cache/.cache or path set by LM_HARNESS_CACHE_PATH env var. |
--check_integrity |
Run task test suite validation before evaluation. |
| Argument |
Description |
--system_instruction |
Custom system instruction prepended to prompts. |
--fewshot_as_multiturn |
Format few-shot examples as multi-turn conversation. Auto-enabled with --apply_chat_template. Set to false to disable. |
Task Management
| Argument |
Description |
--include_path |
Additional directory containing external task YAML files. |
Logging and Tracking
| Argument |
Short |
Description |
--verbosity |
-v |
(Deprecated) Use LMEVAL_LOG_LEVEL env var instead. |
--write_out |
-w |
Print prompts for first few documents (for debugging). |
--show_config |
|
Display full task configuration after evaluation. |
--wandb_args |
|
Weights & Biases arguments as key=val. E.g., project=my-project name=run-1. |
--wandb_config_args |
|
Additional W&B config arguments. |
--hf_hub_log_args |
|
HuggingFace Hub logging arguments. See HF Hub Logging. |
Advanced Options
| Argument |
Short |
Description |
--predict_only |
-x |
Save predictions only, skip metric computation. Implies --log_samples. |
--seed |
|
Random seeds as single integer or comma-separated list for python,numpy,torch,fewshot. Default: 0,1234,1234,1234. Use None to skip. Example: --seed 42 or --seed 0,None,8,52. |
--trust_remote_code |
|
Allow executing remote code from HuggingFace Hub. |
--confirm_run_unsafe_code |
|
Confirm understanding of risks for tasks executing arbitrary Python. |
--metadata |
|
JSON string passed to TaskConfig. Required for some tasks like RULER. Example: --metadata '{"max_seq_length": 4096}'. |
Configuration File
| Argument |
Short |
Description |
--config |
-C |
Path to YAML configuration file. CLI arguments override config file values. See Configuration Files. |
HuggingFace Hub Logging
The --hf_hub_log_args argument accepts these keys:
| Key |
Description |
hub_results_org |
Organization name on HF Hub. Defaults to token owner. |
details_repo_name |
Repository name for detailed results. |
results_repo_name |
Repository name for aggregated results. |
push_results_to_hub |
True/False - push results to Hub. |
push_samples_to_hub |
True/False - push samples to Hub. Requires --log_samples. |
public_repo |
True/False - make repository public. |
leaderboard_url |
URL to associated leaderboard. |
point_of_contact |
Contact email for results dataset. |
gated |
True/False - gate the details dataset. |
lm-eval ls
List available tasks, groups, subtasks, or tags.
lm-eval ls [tasks|groups|subtasks|tags] [--include_path DIR]
Arguments
| Argument |
Description |
tasks |
List all available tasks (groups, subtasks, and tags). |
groups |
List only task groups (e.g., mmlu, glue, superglue). |
subtasks |
List only individual subtasks (e.g., mmlu_anatomy, hellaswag). |
tags |
List task tags (e.g., reasoning, knowledge). |
--include_path |
Additional directory for external task definitions. |
--pattern |
Filter tasks matching a glob pattern (e.g., "mmlu*"). |
Task Organization
- Groups: Collections of related tasks with aggregated metrics across subtasks (e.g.,
mmlu contains 57 subtasks)
- Subtasks: Individual evaluation tasks (e.g.,
mmlu_anatomy, hellaswag)
- Tags: Categories for filtering tasks without aggregated metrics (e.g.,
reasoning, language)
Examples
# List all tasks
lm-eval ls tasks
# List only task groups
lm-eval ls groups
# Filter tasks by pattern
lm-eval ls tasks --pattern "arc*"
# Include external tasks
lm-eval ls tasks --include_path /path/to/external/tasks
lm-eval validate
Validate task configurations before running evaluations.
lm-eval validate --tasks <task1,task2> [--include_path DIR]
Arguments
| Argument |
Short |
Description |
--tasks |
-t |
(Required) Comma-separated list of task names to validate. |
--include_path |
|
Additional directory for external task definitions. |
Validation Checks
The validate command performs:
- Task existence: Verifies all specified tasks are available
- Configuration syntax: Checks YAML/JSON configuration files
- Dataset access: Validates dataset paths and configurations
- Required fields: Ensures all mandatory task parameters are present
- Metric definitions: Verifies metric functions and aggregation methods
- Filter pipelines: Validates filter chains and their parameters
- Template rendering: Tests prompt templates with sample data
Examples
# Validate a single task
lm-eval validate --tasks hellaswag
# Validate multiple tasks
lm-eval validate --tasks arc_easy,arc_challenge,hellaswag
# Validate a task group
lm-eval validate --tasks mmlu
# Validate external tasks
lm-eval validate --tasks my_custom_task --include_path ./custom_tasks
Environment Variables
| Variable |
Description |
LMEVAL_LOG_LEVEL |
Logging level (DEBUG, INFO, WARNING, ERROR). |
LM_HARNESS_CACHE_PATH |
Path for cached requests (default: lm_eval/caching/.cache). |
LM_EVAL_DATASET_DIR |
Local fallback directory for datasets. If set, checked before downloading from HuggingFace Hub. |
HF_TOKEN |
HuggingFace Hub token for private datasets/models. |
TOKENIZERS_PARALLELISM |
Set to false to avoid tokenizer warnings (auto-set by CLI). |