Task Class¶
The Task class defines a single evaluation benchmark — how to load data, format prompts, and score responses.
Task
¶
Task(config: TaskConfig | dict[str, Any])
A task represents an entire benchmark, including its dataset, problems, answers, and evaluation methods.
See BoolQ for a simple example implementation.
A doc can be any python object that represents one instance of evaluation.
This is usually a dictionary e.g.
Source code in lm_eval/api/task/_task.py
Attributes¶
OUTPUT_TYPE
class-attribute
instance-attribute
¶
sampler
cached
property
¶
Lazily create the fewshot sampler (triggers dataset download on first access).
instances
property
¶
Dataset instances which will be evaluated.
Populated after calling task.build_all_requests().
Functions¶
from_config
classmethod
¶
from_config(config: TaskConfig | dict[str, Any])
Factory method to create the appropriate Task subclass based on output_type.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
TaskConfig instance or dict with task configuration
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
|
Instance of the appropriate Task subclass (GenerateTask, MultipleChoiceTask, etc.) |
Source code in lm_eval/api/task/_task.py
count_bytes
staticmethod
¶
count_words
staticmethod
¶
Downstream loglikelihood_rolling perplexity tasks with custom word boundaries should override this!
download
¶
Source code in lm_eval/api/task/_task.py
has_training_docs
¶
has_validation_docs
¶
has_test_docs
¶
training_docs
¶
validation_docs
¶
test_docs
¶
fewshot_docs
¶
Source code in lm_eval/api/task/_task.py
get_docs
¶
doc_iterator
¶
doc_iterator(*, rank: int = 0, limit: int | None = None, world_size: int = 1, samples: Sequence[int] | None = None) -> Iterator[tuple[int, Any]]
Source code in lm_eval/api/task/_task.py
fewshot_context
¶
fewshot_context(doc: dict, num_fewshot: int, system_instruction: str | None = None, apply_chat_template: bool = False, fewshot_as_multiturn: bool = False, chat_template: ChatTemplate | None = None, gen_prefix: str | None = None) -> str | list[str]
Build the full prompt context including system prompt, few-shot examples, and eval doc.
Constructs a complete prompt by:
1. Adding system instruction + task description (if provided)
2. Adding num_fewshot labeled examples from the fewshot split
3. Adding the evaluation document (without its answer)
Each component is built using build_qa_turn() and can be rendered as plain
text or formatted via a chat template.
| PARAMETER | DESCRIPTION |
|---|---|
doc
|
The evaluation document to build context for.
TYPE:
|
num_fewshot
|
Number of few-shot examples to include.
TYPE:
|
system_instruction
|
System instruction to prepend to the prompt.
TYPE:
|
apply_chat_template
|
If True, format output using the chat template.
TYPE:
|
fewshot_as_multiturn
|
If True, keep few-shot examples as separate user/assistant turns. If False, collapse into a single user message.
TYPE:
|
chat_template
|
Renders a list of message dicts to a string.
TYPE:
|
gen_prefix
|
Prefix to start the assistant's response (e.g., "Answer:").
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str | list[str]
|
str | list[str]: The formatted prompt string, or a list of strings for multiple-input tasks (e.g., Winogrande where each choice becomes a separate context). |
Source code in lm_eval/api/task/_task.py
369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 | |
construct_requests
abstractmethod
¶
construct_requests(doc: dict[str, Any], ctx: str | Sequence[str] | list[dict[str, Any]], *, doc_id: int, metadata: dict[str, Any] | None = None, apply_chat_template: bool = False, chat_template: ChatTemplate | None = None, **kwargs) -> list[GenInstance] | list[LLInstance] | None
Convert a doc and its prompt context into Instance objects for the LM.
Called by build_all_requests after fewshot_context has produced
the prompt. Each subclass maps the prompt into the request format its
output type requires (loglikelihood pairs, generation args, etc.).
| PARAMETER | DESCRIPTION |
|---|---|
doc
|
The evaluation document from the dataset split.
TYPE:
|
ctx
|
The prompt produced by
TYPE:
|
doc_id
|
Index of the document within the evaluation split.
TYPE:
|
metadata
|
Per-instance metadata forwarded to the Instance.
TYPE:
|
apply_chat_template
|
Whether a chat template was applied.
TYPE:
|
chat_template
|
The chat template callable, if any.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[GenInstance] | list[LLInstance] | None
|
A list of Instances to send to the LM, or None to skip this doc. |
Source code in lm_eval/api/task/_task.py
build_all_requests
¶
build_all_requests(*, limit: int | None = None, samples: Sequence[int] | None = None, rank: int = 0, world_size: int = 1, cache_requests: bool = False, rewrite_requests_cache: bool = False, system_instruction: str | None = None, apply_chat_template: bool = False, fewshot_as_multiturn: bool = False, chat_template: ChatTemplate | None = None, tokenizer_name: str = '') -> list[Instance]
Build all Instance objects for this task and store them in self._instances.
For each document in the evaluation split this method:
1. Builds the prompt via fewshot_context.
2. Converts it to Instance(s) via construct_requests.
3. Optionally loads/saves results from a request cache.
| PARAMETER | DESCRIPTION |
|---|---|
limit
|
Maximum number of documents to evaluate (None = all).
TYPE:
|
samples
|
Explicit list of document indices to evaluate.
TYPE:
|
rank
|
Worker rank for distributed evaluation.
TYPE:
|
world_size
|
Total number of workers.
TYPE:
|
cache_requests
|
Whether to load/save instances from cache.
TYPE:
|
rewrite_requests_cache
|
Force-rebuild the cache even if it exists.
TYPE:
|
system_instruction
|
System prompt prepended to every context.
TYPE:
|
apply_chat_template
|
Whether to render prompts through a chat template.
TYPE:
|
fewshot_as_multiturn
|
Keep few-shot examples as separate chat turns instead of collapsing them into a single user message.
TYPE:
|
chat_template
|
The chat template callable.
TYPE:
|
tokenizer_name
|
Included in the cache key to distinguish tokenizers.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[Instance]
|
Flat list of Instances, also stored in |
Source code in lm_eval/api/task/_task.py
641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 | |
doc_to_text
¶
doc_to_text(doc: Doc, doc_to_text: Callable[[Doc], str | list[str]] | str | None = None) -> str | list[str] | None
Source code in lm_eval/api/task/_task.py
doc_to_choice
¶
doc_to_choice(doc: Doc, doc_to_choice: Callable[[Doc], list[str]] | str | list[str] | None = None) -> list[str] | None
Source code in lm_eval/api/task/_task.py
doc_to_target
¶
doc_to_target(doc: Doc, doc_to_target: Callable[[Doc], str | int | list[int] | list[str]] | str | None = None) -> str | int | list[str] | list[int] | None
Source code in lm_eval/api/task/_task.py
doc_to_image
¶
doc_to_audio
¶
apply_filters
¶
Apply filter ensembles from each scorer to instances.
process_instances
¶
Apply filters, score instances, reduce — all stored on Scorers.
For each scorer, tries the legacy process_results path first
(YAML !function or Python subclass override). Falls through to
scorer.score_instances() only when process_results returns
None.
Source code in lm_eval/api/task/_task.py
process_results
¶
process_results(doc: dict[str, Any], results: Sequence[LLOutput] | Sequence[Completion]) -> dict[str, list[Any]] | None
Source code in lm_eval/api/task/_task.py
aggregate
¶
Aggregate all scorers' reduced results.
Returns (agg_dict, sample_len) where agg_dict has "metric,scorer" string keys. This is the only place where string keys are produced.
Legacy Python tasks that override aggregation() get their custom
functions forwarded to each scorer so that corpus-level metrics
(e.g. SQuAD v2, SCROLLS) are aggregated correctly instead of
falling back to mean.
Source code in lm_eval/api/task/_task.py
aggregation
¶
higher_is_better
¶
get_config
¶
set_config
¶
Set or update the configuration for a given key.
Source code in lm_eval/api/task/_task.py
override_metric
¶
Override the default metrics with a single named metric.
Rebuilds the scorer pipeline so that only metric_name is computed.
Used by the evaluator for predict_only mode (metric="bypass").
Source code in lm_eval/api/task/_task.py
set_repeats
¶
Override the default number of repeats this task.
Source code in lm_eval/api/task/_task.py
set_num_fewshot
¶
Override the default number of fewshot examples for this task.
Source code in lm_eval/api/task/_task.py
set_fewshot_seed
¶
Source code in lm_eval/api/task/_task.py
dump_config
¶
process_doc
staticmethod
¶
Process (detokenize, strip, replace, etc.) an individual document.
Override this to transform documents. Can be used in a map over a data split,
e.g. map(self._process_doc, self.dataset["validation"]).
| RETURNS | DESCRIPTION |
|---|---|
dict
|
The processed version of the specified |