CLI Reference¶

The lm-eval CLI is organized into subcommands:

Command	Description
`lm-eval run`	Run evaluations on language models
`lm-eval ls`	List available tasks, groups, subtasks, or tags
`lm-eval validate`	Validate task configurations

Run the library via the lm-eval entrypoint or python -m lm_eval.

Use -h or --help to see available options:

lm-eval -h              # Show all subcommands
lm-eval run -h          # Show options for run command
lm-eval ls -h           # Show options for list command

Legacy Compatibility: The original single-command interface still works. Running lm-eval --model hf --tasks hellaswag automatically inserts the run subcommand.

`lm-eval run`¶

Run evaluations on language models.

lm-eval run --model <model> --tasks <task> [options]

Examples¶

# Basic evaluation with HuggingFace model
lm-eval run --model hf --model_args pretrained=gpt2 dtype=float32 --tasks hellaswag

# Multiple tasks with few-shot examples
lm-eval run --model vllm --model_args pretrained=EleutherAI/gpt-j-6B --tasks arc_easy arc_challenge --num_fewshot 5

# Custom generation parameters
lm-eval run --model hf --model_args pretrained=gpt2 --tasks lambada --gen_kwargs temperature=0.8 top_p=0.95

# Use a YAML configuration file
lm-eval run --config my_config.yaml --tasks mmlu

Task selection¶

Tasks can be specified by name, group, tag, or path. You can also apply formats and navigate nested groups using special syntax:

# Apply a format at runtime with @
lm-eval run --tasks my_task@mcqa --model hf --model_args pretrained=gpt2

# Try different prompt formats on the same dataset
lm-eval run --tasks hellaswag@generate --model hf --model_args pretrained=gpt2
lm-eval run --tasks hellaswag@cloze --model hf --model_args pretrained=gpt2

# Address a specific subtask within a group with ::
lm-eval run --tasks mmlu::mmlu_anatomy --model hf --model_args pretrained=gpt2

The @format suffix tells lm-eval to apply a prompt format (e.g., mcqa, cloze, generate, cot) at runtime without modifying the task YAML.

The :: path syntax navigates nested groups: group::subgroup::task.

Model and Tasks¶

Argument	Short	Description
`--model`	`-M`	Model type/provider name (default: `hf`). See supported models.
`--model_args`	`-a`	Model constructor arguments as `key=val key2=val2` or `key=val,key2=val2`.
`--tasks`	`-t`	Space or comma-separated list of task names, groups, or tags. Supports `@format` suffix and `::` path syntax.
`--apply_chat_template`		Apply chat template to prompts. Use without argument for default template, or specify template name.
`--limit`	`-L`	Limit examples per task. Integer for count, float (0.0-1.0) for percentage. For testing only.
`--use_cache`	`-c`	Path prefix for SQLite cache of model responses (e.g., `/path/to/cache_`).

Evaluation Settings¶

Argument	Short	Description
`--num_fewshot`	`-f`	Number of few-shot examples in context.
`--batch_size`	`-b`	Batch size: integer, `auto`, or `auto:N` to auto-tune N times (default: 1).
`--max_batch_size`		Maximum batch size when using `--batch_size auto`.
`--device`		Device to use: `cuda`, `cuda:0`, `cpu`, `mps` (default: `cuda`).
`--gen_kwargs`		Generation arguments as `key=val key2=val2`. Values parsed with `ast.literal_eval`. Example: `temperature=0.8 'stop=["\n\n"]'`

Data and Output¶

Argument	Short	Description
`--output_path`	`-o`	Output directory or JSON file for results. Required with `--log_samples`.
`--log_samples`	`-s`	Save all model inputs/outputs for post-hoc analysis.
`--samples`	`-E`	JSON mapping task names to sample indices, e.g., `'{"task1": [0,1,2]}'`. Incompatible with `--limit`.

Caching and Performance¶

Argument	Description
`--cache_requests`	Cache preprocessed prompts: `true`, `refresh`, or `delete`. Cached files stored in `lm_eval/cache/.cache` or path set by `LM_HARNESS_CACHE_PATH` env var.
`--check_integrity`	Run task test suite validation before evaluation.

Prompt Formatting¶

Argument	Description
`--system_instruction`	Custom system instruction prepended to prompts.
`--fewshot_as_multiturn`	Format few-shot examples as multi-turn conversation. Auto-enabled with `--apply_chat_template`. Set to `false` to disable.

Task Management¶

Argument	Description
`--include_path`	Additional directory containing external task YAML files.

Logging and Tracking¶

Argument	Short	Description
`--verbosity`	`-v`	(Deprecated) Use `LMEVAL_LOG_LEVEL` env var instead.
`--write_out`	`-w`	Print prompts for first few documents (for debugging).
`--show_config`		Display full task configuration after evaluation.
`--wandb_args`		Weights & Biases arguments as `key=val`. E.g., `project=my-project name=run-1`.
`--wandb_config_args`		Additional W&B config arguments.
`--hf_hub_log_args`		HuggingFace Hub logging arguments. See HF Hub Logging.

Advanced Options¶

Argument	Short	Description
`--predict_only`	`-x`	Save predictions only, skip metric computation. Implies `--log_samples`.
`--seed`		Random seeds as single integer or comma-separated list for `python,numpy,torch,fewshot`. Default: `0,1234,1234,1234`. Use `None` to skip. Example: `--seed 42` or `--seed 0,None,8,52`.
`--trust_remote_code`		Allow executing remote code from HuggingFace Hub.
`--confirm_run_unsafe_code`		Confirm understanding of risks for tasks executing arbitrary Python.
`--metadata`		JSON string passed to TaskConfig. Required for some tasks like RULER. Example: `--metadata '{"max_seq_length": 4096}'`.

Configuration File¶

Argument	Short	Description
`--config`	`-C`	Path to YAML configuration file. CLI arguments override config file values. See Configuration Files.

HuggingFace Hub Logging¶

The --hf_hub_log_args argument accepts these keys:

Key	Description
`hub_results_org`	Organization name on HF Hub. Defaults to token owner.
`details_repo_name`	Repository name for detailed results.
`results_repo_name`	Repository name for aggregated results.
`push_results_to_hub`	`True`/`False` - push results to Hub.
`push_samples_to_hub`	`True`/`False` - push samples to Hub. Requires `--log_samples`.
`public_repo`	`True`/`False` - make repository public.
`leaderboard_url`	URL to associated leaderboard.
`point_of_contact`	Contact email for results dataset.
`gated`	`True`/`False` - gate the details dataset.

`lm-eval ls`¶

List available tasks, groups, subtasks, or tags.

lm-eval ls [tasks|groups|subtasks|tags] [--include_path DIR]

Arguments¶

Argument	Description
`tasks`	List all available tasks (groups, subtasks, and tags).
`groups`	List only task groups (e.g., `mmlu`, `glue`, `superglue`).
`subtasks`	List only individual subtasks (e.g., `mmlu_anatomy`, `hellaswag`).
`tags`	List task tags (e.g., `reasoning`, `knowledge`).
`--include_path`	Additional directory for external task definitions.
`--pattern`	Filter tasks matching a glob pattern (e.g., `"mmlu*"`).

Task Organization¶

Groups: Collections of related tasks with aggregated metrics across subtasks (e.g., mmlu contains 57 subtasks)
Subtasks: Individual evaluation tasks (e.g., mmlu_anatomy, hellaswag)
Tags: Categories for filtering tasks without aggregated metrics (e.g., reasoning, language)

Examples¶

# List all tasks
lm-eval ls tasks

# List only task groups
lm-eval ls groups

# Filter tasks by pattern
lm-eval ls tasks --pattern "arc*"

# Include external tasks
lm-eval ls tasks --include_path /path/to/external/tasks

`lm-eval validate`¶

Validate task configurations before running evaluations.

lm-eval validate --tasks <task1,task2> [--include_path DIR]

Arguments¶

Argument	Short	Description
`--tasks`	`-t`	(Required) Comma-separated list of task names to validate.
`--include_path`		Additional directory for external task definitions.

Validation Checks¶

The validate command performs:

Task existence: Verifies all specified tasks are available
Configuration syntax: Checks YAML/JSON configuration files
Dataset access: Validates dataset paths and configurations
Required fields: Ensures all mandatory task parameters are present
Metric definitions: Verifies metric functions and aggregation methods
Filter pipelines: Validates filter chains and their parameters
Template rendering: Tests prompt templates with sample data

Examples¶

# Validate a single task
lm-eval validate --tasks hellaswag

# Validate multiple tasks
lm-eval validate --tasks arc_easy,arc_challenge,hellaswag

# Validate a task group
lm-eval validate --tasks mmlu

# Validate external tasks
lm-eval validate --tasks my_custom_task --include_path ./custom_tasks

Environment Variables¶

Variable	Description
`LMEVAL_LOG_LEVEL`	Logging level (`DEBUG`, `INFO`, `WARNING`, `ERROR`).
`LM_HARNESS_CACHE_PATH`	Path for cached requests (default: `lm_eval/caching/.cache`).
`LM_EVAL_DATASET_DIR`	Local fallback directory for datasets. If set, checked before downloading from HuggingFace Hub.
`HF_TOKEN`	HuggingFace Hub token for private datasets/models.
`TOKENIZERS_PARALLELISM`	Set to `false` to avoid tokenizer warnings (auto-set by CLI).

CLI Reference¶

lm-eval run¶

Examples¶

Task selection¶

Model and Tasks¶

Evaluation Settings¶

Data and Output¶

Caching and Performance¶

Prompt Formatting¶

Task Management¶

Logging and Tracking¶

Advanced Options¶

Configuration File¶

HuggingFace Hub Logging¶

lm-eval ls¶

Arguments¶

Task Organization¶

Examples¶

lm-eval validate¶

Arguments¶

Validation Checks¶

Examples¶

Environment Variables¶

`lm-eval run`¶

`lm-eval ls`¶

`lm-eval validate`¶