Skip to content

Entry Points

The main functions for running evaluations programmatically.

Source

simple_evaluate

simple_evaluate(model: str | LM, model_args: str | dict[str, str | int | float] | None = None, tasks: list[str | dict[str, Any] | Task] | None = None, num_fewshot: int | None = None, repeats: int | None = None, batch_size: int | str | None = None, max_batch_size: int | None = None, device: str | None = None, use_cache: str | None = None, cache_requests: bool = False, rewrite_requests_cache: bool = False, delete_requests_cache: bool = False, limit: int | float | None = None, samples: dict[str, list[int]] | None = None, bootstrap_iters: int = 100000, check_integrity: bool = False, write_out: bool = False, log_samples: bool = True, evaluation_tracker: EvaluationTracker | None = None, system_instruction: str | None = None, apply_chat_template: bool | str = False, fewshot_as_multiturn: bool = True, gen_kwargs: str | dict[str, str | float | int] | None = None, task_manager: TaskManager | None = None, verbosity=None, predict_only: bool = False, random_seed: int = DEFAULT_RANDOM_SEED, numpy_random_seed: int = DEFAULT_OTHER_SEED, torch_random_seed: int = DEFAULT_OTHER_SEED, fewshot_random_seed: int = DEFAULT_OTHER_SEED, confirm_run_unsafe_code: bool = False, metadata: dict[str, Any] | None = None) -> EvalResults | None

High-level entry point for evaluation.

Handles model instantiation (from a name string or pre-initialized LM object), task loading via TaskManager, random seed setup, and per-task config overrides (num_fewshot, gen_kwargs, repeats). Delegates the actual inference and metric computation to evaluate, then attaches run metadata (git hash, environment info, tokenizer details) to the returned results.

PARAMETER DESCRIPTION
model

Name of model or LM object. See lm_eval.models.init.py for available aliases.

TYPE: str | LM

model_args

String or dict arguments for each model class, e.g., "pretrained=EleutherAI/pythia-1.3B,revision=main" or {"pretrained": "EleutherAI/pythia-1.3B"}. Ignored if model argument is a LM object.

TYPE: str | dict[str, str | int | float] | None DEFAULT: None

tasks

List of task names or Task objects. Task objects will be taken to have name task.EVAL_HARNESS_NAME if defined and type(task).name otherwise.

TYPE: list[str | dict | Task] DEFAULT: None

num_fewshot

Number of examples in few-shot context.

TYPE: int DEFAULT: None

repeats

Number of times to repeat each request. Overrides task-level config. Only effective for generative tasks.

TYPE: int | None DEFAULT: None

batch_size

Batch size for model.

TYPE: int | str | None DEFAULT: None

max_batch_size

Maximal batch size to try with automatic batch size detection.

TYPE: int | None DEFAULT: None

device

PyTorch device (e.g. "cpu" or "cuda:0") for running models.

TYPE: str | None DEFAULT: None

use_cache

A path to a sqlite db file for caching model responses. None if not caching.

TYPE: str | None DEFAULT: None

cache_requests

Speed up evaluation by caching the building of dataset requests (inputs). None if not caching.

TYPE: bool DEFAULT: False

rewrite_requests_cache

Rewrites all the request cache if set to True. None if not desired.

TYPE: bool DEFAULT: False

delete_requests_cache

Deletes all the request cache if set to True. None if not desired.

TYPE: bool DEFAULT: False

limit

Limit the number of examples per task (only use this for testing). If <1, the limit is a percentage of the total number of examples.

TYPE: int | float | None DEFAULT: None

samples

Dictionary indicating which examples should be tested in each task, e.g., {"mmlu_astronomy": [0, 3, 6], "mmlu_anatomy": [1, 4, 7, 10]}. Incompatible with limit.

TYPE: dict | None DEFAULT: None

bootstrap_iters

Number of iterations for bootstrap statistics, used when calculating stderrs. Set to 0 for no stderr calculations to be performed.

TYPE: int DEFAULT: 100000

check_integrity

Whether to run the relevant part of the test suite for the tasks.

TYPE: bool DEFAULT: False

write_out

If True, write out an example document and model input for checking task integrity.

TYPE: bool DEFAULT: False

log_samples

If True, write out all model outputs and documents for per-sample measurement and post-hoc analysis.

TYPE: bool DEFAULT: True

evaluation_tracker

Tracker for logging experiment configuration and results.

TYPE: EvaluationTracker | None DEFAULT: None

system_instruction

System instruction to be applied to the prompt.

TYPE: str | None DEFAULT: None

apply_chat_template

Specifies whether to apply a chat template to the prompt. If set to True, the default chat template is applied. If set to a string, applies the specified chat template by name. Defaults to False (no chat template applied).

TYPE: bool | str DEFAULT: False

fewshot_as_multiturn

Whether to provide the fewshot examples as a multiturn conversation or a single user turn.

TYPE: bool DEFAULT: True

gen_kwargs

Arguments for model generation. Ignored for all tasks with loglikelihood output_type.

TYPE: dict | str | None DEFAULT: None

task_manager

Task manager instance to use.

TYPE: TaskManager | None DEFAULT: None

verbosity

Verbosity level for logging.

TYPE: str | None DEFAULT: None

predict_only

If True, only model outputs will be generated and returned. Metrics will not be evaluated.

TYPE: bool DEFAULT: False

random_seed

Random seed for python's random module. If set to None, the seed will not be set.

TYPE: int DEFAULT: DEFAULT_RANDOM_SEED

numpy_random_seed

Random seed for numpy. If set to None, the seed will not be set.

TYPE: int DEFAULT: DEFAULT_OTHER_SEED

torch_random_seed

Random seed for torch. If set to None, the seed will not be set.

TYPE: int DEFAULT: DEFAULT_OTHER_SEED

fewshot_random_seed

Random seed for fewshot sampler random generator. If set to None, the seed of the generator will be set to None.

TYPE: int DEFAULT: DEFAULT_OTHER_SEED

confirm_run_unsafe_code

Whether to confirm running tasks marked as unsafe (e.g., code execution tasks).

TYPE: bool DEFAULT: False

metadata

Additional metadata to be added to the task manager. Will get passed to the download function of the task.

TYPE: dict | None DEFAULT: None

RETURNS DESCRIPTION
EvalResults | None

dict | None: Dictionary of results, or None if not on rank 0.

Source code in lm_eval/evaluator.py
@positional_deprecated
def simple_evaluate(
    model: str | LM,
    model_args: str | dict[str, str | int | float] | None = None,
    tasks: list[str | dict[str, Any] | Task] | None = None,
    num_fewshot: int | None = None,
    repeats: int | None = None,
    batch_size: int | str | None = None,
    max_batch_size: int | None = None,
    device: str | None = None,
    use_cache: str | None = None,
    cache_requests: bool = False,
    rewrite_requests_cache: bool = False,
    delete_requests_cache: bool = False,
    limit: int | float | None = None,  # noqa: PYI041
    samples: dict[str, list[int]] | None = None,
    bootstrap_iters: int = 100000,
    check_integrity: bool = False,
    write_out: bool = False,
    log_samples: bool = True,
    evaluation_tracker: EvaluationTracker | None = None,
    system_instruction: str | None = None,
    apply_chat_template: bool | str = False,
    fewshot_as_multiturn: bool = True,
    gen_kwargs: str | dict[str, str | float | int] | None = None,
    task_manager: TaskManager | None = None,
    verbosity=None,
    predict_only: bool = False,
    random_seed: int = DEFAULT_RANDOM_SEED,
    numpy_random_seed: int = DEFAULT_OTHER_SEED,
    torch_random_seed: int = DEFAULT_OTHER_SEED,
    fewshot_random_seed: int = DEFAULT_OTHER_SEED,
    confirm_run_unsafe_code: bool = False,
    metadata: dict[str, Any] | None = None,
) -> EvalResults | None:
    """High-level entry point for evaluation.

    Handles model instantiation (from a name string or pre-initialized LM object),
    task loading via TaskManager, random seed setup, and per-task config overrides
    (num_fewshot, gen_kwargs, repeats). Delegates the actual inference and metric
    computation to [evaluate][evaluate], then attaches run metadata (git hash,
    environment info, tokenizer details) to the returned results.

    Args:
        model (str | LM): Name of model or LM object. See
            lm_eval.models.__init__.py for available aliases.
        model_args: String or dict arguments for each model class, e.g.,
            "pretrained=EleutherAI/pythia-1.3B,revision=main" or {"pretrained": "EleutherAI/pythia-1.3B"}.
            Ignored if ``model`` argument is a LM object.
        tasks (list[str | dict | Task]): List of task names or Task objects.
            Task objects will be taken to have name task.EVAL_HARNESS_NAME if defined
            and type(task).__name__ otherwise.
        num_fewshot (int): Number of examples in few-shot context.
        repeats (int | None): Number of times to repeat each request. Overrides
            task-level config. Only effective for generative tasks.
        batch_size (int | str | None): Batch size for model.
        max_batch_size (int | None): Maximal batch size to try with automatic
            batch size detection.
        device (str | None): PyTorch device (e.g. "cpu" or "cuda:0") for running
            models.
        use_cache (str | None): A path to a sqlite db file for caching model
            responses. `None` if not caching.
        cache_requests (bool): Speed up evaluation by caching the building of
            dataset requests (inputs). `None` if not caching.
        rewrite_requests_cache (bool): Rewrites all the request cache if set to
            `True`. `None` if not desired.
        delete_requests_cache (bool): Deletes all the request cache if set to
            `True`. `None` if not desired.
        limit (int | float | None): Limit the number of examples per task (only
            use this for testing). If <1, the limit is a percentage of the total
            number of examples.
        samples (dict | None): Dictionary indicating which examples should be
            tested in each task, e.g.,
            {"mmlu_astronomy": [0, 3, 6], "mmlu_anatomy": [1, 4, 7, 10]}.
            Incompatible with `limit`.
        bootstrap_iters (int): Number of iterations for bootstrap statistics, used
            when calculating stderrs. Set to 0 for no stderr calculations to be
            performed.
        check_integrity (bool): Whether to run the relevant part of the test suite
            for the tasks.
        write_out (bool): If True, write out an example document and model input
            for checking task integrity.
        log_samples (bool): If True, write out all model outputs and documents for
            per-sample measurement and post-hoc analysis.
        evaluation_tracker (EvaluationTracker | None): Tracker for logging
            experiment configuration and results.
        system_instruction (str | None): System instruction to be applied to the
            prompt.
        apply_chat_template (bool | str): Specifies whether to apply a chat
            template to the prompt. If set to True, the default chat template is
            applied. If set to a string, applies the specified chat template by
            name. Defaults to False (no chat template applied).
        fewshot_as_multiturn (bool): Whether to provide the fewshot examples as a
            multiturn conversation or a single user turn.
        gen_kwargs (dict | str | None): Arguments for model generation. Ignored
            for all tasks with loglikelihood output_type.
        task_manager (TaskManager | None): Task manager instance to use.
        verbosity (str | None): Verbosity level for logging.
        predict_only (bool): If True, only model outputs will be generated and
            returned. Metrics will not be evaluated.
        random_seed (int): Random seed for python's random module. If set to None,
            the seed will not be set.
        numpy_random_seed (int): Random seed for numpy. If set to None, the seed
            will not be set.
        torch_random_seed (int): Random seed for torch. If set to None, the seed
            will not be set.
        fewshot_random_seed (int): Random seed for fewshot sampler random generator.
            If set to None, the seed of the generator will be set to None.
        confirm_run_unsafe_code (bool): Whether to confirm running tasks marked
            as unsafe (e.g., code execution tasks).
        metadata (dict | None): Additional metadata to be added to the task
            manager. Will get passed to the download function of the task.

    Returns:
        dict | None: Dictionary of results, or None if not on rank 0.
    """
    if verbosity is not None:
        eval_logger.info("Setting verbosity through simple_evaluate is deprecated.")
    start_date = time.time()

    if limit is not None and samples is not None:
        raise ValueError(
            "Either 'limit' or 'samples' must be None, but both are not None."
        )

    _NEEDS_CHAT_TEMPLATE = ("inst", "chat")
    if (
        (
            isinstance(model_args, str)
            and any(kw in model_args.lower() for kw in _NEEDS_CHAT_TEMPLATE)
        )
        or (
            isinstance(model_args, dict)
            and any(
                any(kw in str(v).lower() for kw in _NEEDS_CHAT_TEMPLATE)
                for v in model_args.values()
            )
        )
    ) and not apply_chat_template:
        eval_logger.warning(
            wrap_text(
                """pretrained=%s appears to be an
                instruct or chat variant but chat template is not applied.
                Recommend setting `apply_chat_template` (optionally `fewshot_as_multiturn`).""",
            ),
            model_args.get("pretrained")
            if isinstance(model_args, dict)
            else model_args,
        )

    if delete_requests_cache:
        eval_logger.info("Deleting requests cache...")
        delete_cache()

    seed_message = []
    if random_seed is not None:
        # See https://github.com/EleutherAI/lm-evaluation-harness/pull/1412
        seed_message.append(f"Setting random seed to {random_seed}")
        random.seed(random_seed)

    if numpy_random_seed is not None:
        seed_message.append(f"Setting numpy seed to {numpy_random_seed}")
        np.random.seed(numpy_random_seed)

    if torch_random_seed is not None:
        seed_message.append(f"Setting torch manual seed to {torch_random_seed}")
        set_torch_seed(torch_random_seed)

    if fewshot_random_seed is not None:
        seed_message.append(f"Setting fewshot manual seed to {fewshot_random_seed}")

    if seed_message:
        eval_logger.info(" | ".join(seed_message))

    if tasks is None:
        tasks = []
    if len(tasks) == 0:
        raise ValueError(
            "No tasks specified, or no tasks found. Please verify the task names."
        )

    if gen_kwargs:
        if isinstance(gen_kwargs, str):
            gen_kwargs = simple_parse_args_string(gen_kwargs)
        eval_logger.warning(
            "generation_kwargs: %s specified through cli, these settings will update set parameters in yaml tasks. "
            "Ensure 'do_sample=True' for non-greedy decoding!",
            gen_kwargs,
        )
        if not gen_kwargs:
            gen_kwargs = None

    if isinstance(model, str):
        if model_args is None:
            eval_logger.warning("model_args not specified. Using defaults.")
            model_args = ""

        if isinstance(model_args, dict):
            eval_logger.info(
                "Initializing %s model, with arguments: %s", model, model_args
            )
            lm = lm_eval.api.registry.get_model(model).create_from_arg_obj(
                model_args,
                {
                    "batch_size": batch_size,
                    "max_batch_size": max_batch_size,
                    "device": device,
                },
            )

        else:
            eval_logger.info(
                wrap_text("Initializing %s model, with arguments: %s"),
                model,
                simple_parse_args_string(model_args),
            )
            lm = lm_eval.api.registry.get_model(model).create_from_arg_string(
                model_args,
                {
                    "batch_size": batch_size,
                    "max_batch_size": max_batch_size,
                    "device": device,
                },
            )
    else:
        if not isinstance(model, lm_eval.api.model.LM):
            raise TypeError(
                f"The value of `model` passed to simple_evaluate() was of type {type(model)}, but is required to be a subclass of lm_eval.api.model.LM . "
                f"This may be because you are passing an initialized Hugging Face PreTrainedModel without having wrapped it in "
                f"`lm_eval.models.huggingface.HFLM(pretrained=my_model)` first."
            )
        eval_logger.info("Using pre-initialized model")
        lm = model

    if use_cache is not None:
        eval_logger.info(
            "Using cache at %s", use_cache + "_rank" + str(lm.rank) + ".db"
        )
        lm = lm_eval.api.model.CachingLM(
            lm,
            use_cache
            # each rank receives a different cache db.
            # necessary to avoid multiple writes to cache at once
            + "_rank"
            + str(lm.rank)
            + ".db",
        )

    if task_manager is None:
        metadata = (
            simple_parse_args_string(model_args)
            if isinstance(model_args, str)
            else model_args
            if isinstance(model_args, dict)
            else {}
        ) | (metadata or {})
        task_manager = TaskManager(metadata=metadata)

    # Load tasks - returns {"tasks":.., "groups":..}
    loaded = task_manager.load(tasks)

    # Log selected tasks with hierarchy
    _log_selected_tasks_(loaded["tasks"], loaded["groups"])

    # Apply config overrides to tasks
    for task_name, task_obj in loaded["tasks"].items():
        if task_obj.get_config("output_type") == "generate_until":
            if gen_kwargs is not None:
                task_obj.set_config(
                    key="generation_kwargs", value=gen_kwargs, update=True
                )
            eval_logger.info(
                "%s: Using gen_kwargs: %s",
                task_obj.config.task,
                task_obj.config.generation_kwargs,
            )

        if predict_only:
            eval_logger.info(
                "Processing %s in output-only mode. Metrics will not be calculated!",
                task_name,
            )
            task_obj.override_metric(metric_name="bypass")

        # override tasks' fewshot values to the provided num_fewshot arg value
        # except if tasks have it set to 0 manually in their configs--then we should never overwrite that
        if num_fewshot is not None:
            task_obj.set_num_fewshot(num_fewshot)
        task_obj.set_fewshot_seed(seed=fewshot_random_seed)

        if repeats is not None:
            # only generation tasks support repeats > 1, otherwise no-op
            task_obj.set_repeats(repeats)

    if check_integrity:
        run_task_tests(task_list=tasks)

    if evaluation_tracker is not None:
        evaluation_tracker.general_config_tracker.log_experiment_args(
            model_source=model if isinstance(model, str) else "CUSTOM",
            model_args=model_args or "",
            system_instruction=system_instruction,
            chat_template=lm.chat_template(apply_chat_template)
            if apply_chat_template
            else None,
            fewshot_as_multiturn=fewshot_as_multiturn,
        )

    results: EvalResults = evaluate(
        lm=lm,
        task_dict=loaded,
        limit=limit,
        samples=samples,
        cache_requests=cache_requests,
        rewrite_requests_cache=rewrite_requests_cache,
        bootstrap_iters=bootstrap_iters,
        write_out=write_out,
        log_samples=True if predict_only else log_samples,
        system_instruction=system_instruction,
        apply_chat_template=apply_chat_template,
        fewshot_as_multiturn=fewshot_as_multiturn,
        verbosity=verbosity,
        confirm_run_unsafe_code=confirm_run_unsafe_code,
    )
    if verbosity is not None:
        setup_logging(verbosity=verbosity)

    if results:
        if isinstance(model, str):
            model_name = model
        elif hasattr(model, "config") and hasattr(model.config, "_name_or_path"):
            model_name = model.config._name_or_path
        else:
            model_name = type(model).__name__

        # add info about the model and few shot config
        results["config"] = {
            "model": model_name,
            "model_args": model_args,
        }
        # add more detailed model info if available
        if hasattr(lm, "get_model_info"):
            results["config"].update(lm.get_model_info())  # type: ignore
        # add info about execution
        results["config"].update(
            {
                "batch_size": batch_size,
                "batch_sizes": (
                    list(lm.batch_sizes.values()) if hasattr(lm, "batch_sizes") else []  # type: ignore
                ),
                "device": device,
                "use_cache": use_cache,
                "limit": limit,
                "bootstrap_iters": bootstrap_iters,
                "gen_kwargs": gen_kwargs,
                "random_seed": random_seed,
                "numpy_seed": numpy_random_seed,
                "torch_seed": torch_random_seed,
                "fewshot_seed": fewshot_random_seed,
            }
        )
        results["git_hash"] = get_git_commit_hash()
        results["date"] = start_date
        add_env_info(results)  # additional environment info to results
        add_tokenizer_info(results, lm)  # additional info about tokenizer
        return results
    else:
        return None

evaluate

evaluate(lm: LM, task_dict: TaskDict | _NestedDict, limit: int | None = None, samples: dict[str, list[int]] | None = None, cache_requests: bool = False, rewrite_requests_cache: bool = False, bootstrap_iters: int | None = 100000, write_out: bool = False, log_samples: bool = True, system_instruction: str | None = None, apply_chat_template: bool | str = False, fewshot_as_multiturn: bool = False, verbosity: str = 'INFO', confirm_run_unsafe_code: bool = False) -> EvalResults | None

Run inference and compute metrics for a pre-initialized model and task set.

This is the lower-level evaluation loop. It builds per-task request instances, dispatches them to the model by request type (loglikelihood, generate_until, etc.), collects responses, post-processes outputs via each task's scorers, and aggregates metrics across samples. Handles multi-rank (FSDP/DDP) padding and result gathering.

Prefer simple_evaluate unless you need direct control over model initialization and task loading.

PARAMETER DESCRIPTION
lm

Language Model.

TYPE: LM

task_dict

Dictionary returned by TaskManager.load() containing 'tasks', 'groups', and 'group_map' entries.

TYPE: TaskDict

limit

Limit the number of examples per task (only use this for testing).

TYPE: int | None DEFAULT: None

samples

Dictionary indicating which examples should be tested in each task, e.g., {"mmlu_astronomy": [0, 3, 6], "mmlu_anatomy": [1, 4, 7, 10]}.

TYPE: dict | None DEFAULT: None

cache_requests

Speed up evaluation by caching the building of dataset requests.

TYPE: bool DEFAULT: False

rewrite_requests_cache

Rewrites all the request cache if set to True.

TYPE: bool DEFAULT: False

bootstrap_iters

Number of iterations for bootstrap statistics, used when calculating stderr. Set to 0 for skipping all stderr calculations.

TYPE: int | None DEFAULT: 100000

write_out

If True, write out an example document and model input for checking task integrity.

TYPE: bool DEFAULT: False

log_samples

If True, write out all model outputs and documents for per-sample measurement and post-hoc analysis.

TYPE: bool DEFAULT: True

system_instruction

System instruction to be applied to the prompt.

TYPE: str | None DEFAULT: None

apply_chat_template

Specifies whether to apply a chat template to the prompt. If set to True, the default chat template is applied. If set to a string, applies the specified chat template by name. Defaults to False (no chat template applied).

TYPE: bool | str DEFAULT: False

fewshot_as_multiturn

Whether to provide the fewshot examples as a multiturn conversation or a single user turn.

TYPE: bool DEFAULT: False

verbosity

Verbosity level for logging. (no-op, deprecated)

TYPE: str DEFAULT: 'INFO'

confirm_run_unsafe_code

Whether to confirm running tasks marked as unsafe (e.g code execution tasks).

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
EvalResults | None

dict | None: Dictionary of results, or None if not on rank 0.

Source code in lm_eval/evaluator.py
@positional_deprecated
def evaluate(
    lm: LM,
    task_dict: TaskDict | _NestedDict,
    limit: int | None = None,
    samples: dict[str, list[int]] | None = None,
    cache_requests: bool = False,
    rewrite_requests_cache: bool = False,
    bootstrap_iters: int | None = 100000,
    write_out: bool = False,
    log_samples: bool = True,
    system_instruction: str | None = None,
    apply_chat_template: bool | str = False,
    fewshot_as_multiturn: bool = False,
    verbosity: str = "INFO",
    confirm_run_unsafe_code: bool = False,
) -> EvalResults | None:
    """Run inference and compute metrics for a pre-initialized model and task set.

    This is the lower-level evaluation loop. It builds per-task request instances,
    dispatches them to the model by request type (loglikelihood, generate_until, etc.),
    collects responses, post-processes outputs via each task's scorers, and aggregates
    metrics across samples. Handles multi-rank (FSDP/DDP) padding and result gathering.

    Prefer [simple_evaluate][simple_evaluate] unless you need direct control over model
    initialization and task loading.

    Args:
        lm (LM): Language Model.
        task_dict (TaskDict): Dictionary returned by TaskManager.load() containing
            'tasks', 'groups', and 'group_map' entries.
        limit (int | None): Limit the number of examples per task (only use this
            for testing).
        samples (dict | None): Dictionary indicating which examples should be
            tested in each task, e.g.,
            {"mmlu_astronomy": [0, 3, 6], "mmlu_anatomy": [1, 4, 7, 10]}.
        cache_requests (bool): Speed up evaluation by caching the building of
            dataset requests.
        rewrite_requests_cache (bool): Rewrites all the request cache if set to
            `True`.
        bootstrap_iters (int | None): Number of iterations for bootstrap
            statistics, used when calculating stderr. Set to 0 for skipping all
            stderr calculations.
        write_out (bool): If True, write out an example document and model input
            for checking task integrity.
        log_samples (bool): If True, write out all model outputs and documents
            for per-sample measurement and post-hoc analysis.
        system_instruction (str | None): System instruction to be applied to the
            prompt.
        apply_chat_template (bool | str): Specifies whether to apply a chat
            template to the prompt. If set to True, the default chat template is
            applied. If set to a string, applies the specified chat template by
            name. Defaults to False (no chat template applied).
        fewshot_as_multiturn (bool): Whether to provide the fewshot examples as a
            multiturn conversation or a single user turn.
        verbosity (str): Verbosity level for logging. (no-op, deprecated)
        confirm_run_unsafe_code (bool): Whether to confirm running tasks marked
            as unsafe (e.g code execution tasks).

    Returns:
        dict | None: Dictionary of results, or None if not on rank 0.
    """
    if limit is not None and samples is not None:
        raise ValueError(
            "Either 'limit' or 'samples' must be None, but both are not None."
        )
    if samples is not None:
        eval_logger.info("Evaluating examples for tasks %s", list(samples.keys()))
    # tracks all Instances/requests a model must generate output on.
    requests = defaultdict(list)
    # stores the amount to pad out reqs per req type so that
    # the number of fwd passes per distributed rank is equal
    padding_requests = defaultdict(int)

    # Initialize groups and tasks
    # handle back compact. Assume if "tasks" key is not present, then using old nested.
    if "tasks" not in task_dict:
        groups, eval_tasks = _handle_back_comp(cast("_NestedDict", task_dict))
    else:
        task_dict = cast("TaskDict", task_dict)
        groups, eval_tasks = task_dict.get("groups", {}), task_dict.get("tasks", {})

    # Initialize accumulators for per-sample metrics and logged samples
    eval_results_acc: dict[str, _ResultAcc] = {
        task_name: {
            "task": task_obj,
            "logged_samples": [],
        }
        for task_name, task_obj in eval_tasks.items()
    }
    if not log_samples and _has_bypass_metric(eval_tasks):
        raise ValueError("log_samples must be True for 'bypass' metric-only tasks")

    # validation checks:
    # 1.are we running code that is marked as unsafe.
    # 2.are we running multimodal task <-> non-multimodal model class, or vice versa.
    incompatible_tasks = []
    for task_name, task in eval_tasks.items():
        if getattr(task, "UNSAFE_CODE", False) and not confirm_run_unsafe_code:
            raise ValueError(
                f"Attempted to run task: {task_name} which is marked as unsafe. Set confirm_run_unsafe_code=True to run this task."
            )

        if getattr(task, "MULTIMODAL", False) and not getattr(lm, "MULTIMODAL", False):
            incompatible_tasks.append(task_name)
    if len(incompatible_tasks) > 0 and not getattr(lm, "MULTIMODAL", False):
        raise ValueError(
            f"Attempted to run tasks: {incompatible_tasks} which require multimodal input, but the selected model type does not currently implement this. Multimodal support is currently restricted to the ['hf-multimodal', 'vllm-vlm'] model type."
        )
    # end validation check

    # Cache the limit arg.
    limit_arg = limit
    for task_name, task in tqdm(
        eval_tasks.items(),
        delay=5,
        desc="Building contexts on all ranks",
        disable=lm.rank != 0,
    ):
        limit = _get_sample_size(task, limit_arg)
        task.build_all_requests(
            limit=limit,
            samples=samples.get(task_name, None) if samples is not None else samples,
            rank=lm.rank,
            world_size=lm.world_size,
            cache_requests=cache_requests,
            rewrite_requests_cache=rewrite_requests_cache,
            system_instruction=system_instruction,
            apply_chat_template=bool(apply_chat_template),
            fewshot_as_multiturn=fewshot_as_multiturn,
            chat_template=getattr(lm, "apply_chat_template", None)
            if apply_chat_template
            else None,
            tokenizer_name=getattr(lm, "tokenizer_name", "")
            if apply_chat_template
            else "",
        )
        eval_logger.debug(
            "Task: %s; number of requests on this rank: %d",
            task_name,
            len(task.instances),
            extra={"all_ranks": True},
        )
        if write_out:
            _print_writeout(task)
        # aggregate Instances by LM method requested to get output.
        for instance in task.instances:
            reqtype = instance.request_type
            requests[reqtype].append(instance)

        if lm.world_size > 1:
            import torch

            instances_rnk = torch.tensor(
                len(task._instances) if task._instances else 0, device=lm.device
            )
            gathered_item = lm.all_gather(instances_rnk).cpu().detach().numpy().tolist()
            # "multiple_choice" task types dispatch (several) "loglikelihood" request types
            reqtype = (
                "loglikelihood"
                if task.OUTPUT_TYPE == "multiple_choice"
                else task.OUTPUT_TYPE
            )
            # compute the number of pseudo-batches to pad with (FSDP/DDP require even batches among ranks)
            numpad = max(gathered_item) - gathered_item[lm.rank]
            # todo: may not account for padding in cases like SquadV2 which has multiple req types
            padding_requests[reqtype] += numpad

    ### Run LM on inputs, get all outputs ###
    # execute each type of request
    for reqtype, reqs in requests.items():
        eval_logger.info("Running %s requests", reqtype)
        # create `K` copies of each request `req` based off `K = req.repeats`
        cloned_reqs = []
        for req in reqs:
            cloned_reqs.extend([req] * req.repeats)

        if (lm.world_size > 1) and (padding_requests[reqtype] > 0):
            for _ in range(padding_requests[reqtype]):
                cloned_reqs.extend([req] * req.repeats)

        # run requests through the model
        resps = getattr(lm, reqtype)(cloned_reqs)

        # put responses from the model into a list of length K for each request.
        for x, req in zip(resps, cloned_reqs, strict=True):
            req.resps.append(x)

        if lm.world_size > 1:
            lm.barrier()

    RANK = lm.rank
    WORLD_SIZE = lm.world_size
    ## can delete model from this point.
    ### Postprocess outputs ###
    for task_name, acc in eval_results_acc.items():
        task = acc["task"]
        task.process_instances()  # populates Scorer internals

        if log_samples:
            acc["logged_samples"] = _build_logged_samples(task, samples, task_name)

    if WORLD_SIZE > 1:
        if log_samples:
            rank_samples = {
                task_name: acc["logged_samples"]
                for task_name, acc in eval_results_acc.items()
            }
            all_samples = torch_gather_object(
                rank_samples, rank=RANK, world_size=WORLD_SIZE, dst=0
            )
            if RANK == 0:
                for task_name, acc in eval_results_acc.items():
                    acc["logged_samples"] = [
                        sample
                        for rank_data in all_samples  # type: ignore
                        for sample in rank_data[task_name]
                    ]

        rank_metrics = {
            task_name: acc["task"]._export_reduced()
            for task_name, acc in eval_results_acc.items()
        }
        all_metrics = torch_gather_object(
            rank_metrics, rank=RANK, world_size=WORLD_SIZE, dst=0
        )
        if RANK == 0:
            for task_name, acc in eval_results_acc.items():
                merged = _merge_rank_metrics(all_metrics, task_name)  # type: ignore
                acc["task"]._import_reduced(merged)

    if RANK == 0:
        res = _process_results(eval_results_acc, groups, bootstrap_iters)

        samples = None
        if log_samples:
            samples = res.samples
            if LMEVAL_HASHMM and hasattr(lm, "MULTIMODAL"):
                samples = hash_dict_images(samples)

        return res._to_eval_results(samples=samples)
    else:
        return None