Skip to content

LM Base Class

Abstract base class for language models. Subclass this to add a new model backend to the evaluation harness.

Source

LM

LM()

Bases: ABC


              flowchart TD
              lm_eval.api.model.LM[LM]

              

              click lm_eval.api.model.LM href "" "lm_eval.api.model.LM"
            

Abstract base class for language models.

Subclasses take text (strings) as input and yield strings as output. Inputs and outputs should be tokenization-agnostic.

Source code in lm_eval/api/model.py
def __init__(self) -> None:
    # set rank and world size to a single process, by default.
    self._rank = 0
    self._world_size = 1
    self._device = None
    self.cache_hook: CacheHook = CacheHook(None)

Attributes

cache_hook instance-attribute

cache_hook: CacheHook = CacheHook(None)

device property

device

rank property

rank: int

Index of this process. Default: 0 (single-process).

world_size property

world_size: int

Total number of processes. Default: 1 (single-process).

tokenizer_name property

tokenizer_name: str

Name of the tokenizer or chat template, used to fingerprint request caches.

Required for subclasses that support chat templating.

Functions

loglikelihood abstractmethod

loglikelihood(requests: Sequence[LLInstance]) -> list[LLOutput]

Compute log-likelihood of generating a continuation from a context.

Downstream tasks should prefer this over other LM calls whenever possible.

PARAMETER DESCRIPTION
requests

List of Instance objects. Each Instance.args is a (context, continuation) tuple. context — the conditioning text (implementations must handle empty string). continuation — the text to score. Word-boundary spaces belong in the continuation (e.g. context="hello" continuation=" world").

TYPE: Sequence[LLInstance]

RETURNS DESCRIPTION
list[LLOutput]

A list of (logprob, is_greedy) tuples — (summed log-probability of

list[LLOutput]

the continuation, whether it would be produced by greedy decoding).

Source code in lm_eval/api/model.py
@abc.abstractmethod
def loglikelihood(self, requests: Sequence[LLInstance]) -> list[LLOutput]:
    """Compute log-likelihood of generating a continuation from a context.

    Downstream tasks should prefer this over other LM calls whenever possible.

    Args:
        requests: List of ``Instance`` objects. Each ``Instance.args`` is a ``(context, continuation)`` tuple.
            *context* — the conditioning text (implementations must handle empty string).
            *continuation* — the text to score. Word-boundary spaces belong in the
            continuation (e.g. ``context="hello"  continuation=" world"``).

    Returns:
        A list of ``(logprob, is_greedy)`` tuples — (summed log-probability of
        the continuation, whether it would be produced by greedy decoding).
    """
    ...

loglikelihood_rolling abstractmethod

loglikelihood_rolling(requests: Sequence[LLInstance]) -> list[LLOutput]

Compute full log-likelihood of a string, with no truncation, for perplexity computation.

  • Uses the full max context length of the model.
  • Inputs exceeding that length are chunked, up to the max context length.
  • IMPORTANT: Each document's loglikelihood/perplexity is computed separately, unlike other implementations which may simply concatenate multiple documents together.
  • IMPORTANT: We maximize the amount of context for each prediction. Specifically, for inputs that we break into multiple chunks, the last input will still a full-sized context.
Example
Input tokens: [ 0 1 2 3 4 5 6 7 8 9 ]
Prefix: BOS/EOS
Max context length: 4
Resulting input/prediction pairs:

    INPUT:  BOS   0   1   2
    PRED:     0   1   2   3

    INPUT:    3   4   5   6
    PRED:     4   5   6   7

    INPUT:    5   6   7   8
    PRED:             8   9

Observe that:
  1. Each token is predicted exactly once
  2. For the last pair, we provide the full context, but only score the last two tokens
PARAMETER DESCRIPTION
requests

List of Instance objects. Each Instance.args is a (Literal[""], string) tuple containing the text whose overall log-likelihood is computed. Context is always an empty string to keep the interface consistent with loglikelihood.

TYPE: Sequence[LLInstance]

RETURNS DESCRIPTION
list[LLOutput]

A list of (logprob, Literal[False]) tuples — the log-probability of the string

list[LLOutput]

conditioned on the BOS/EOS token (or prefix_token_id).

list[LLOutput]

The second element is always False since this method does not compute greedy likelihood.

Source code in lm_eval/api/model.py
@abc.abstractmethod
def loglikelihood_rolling(self, requests: Sequence[LLInstance]) -> list[LLOutput]:
    """Compute full log-likelihood of a string, with no truncation, for perplexity computation.

    - Uses the full max context length of the model.
    - Inputs exceeding that length are chunked, up to the max context length.
    - IMPORTANT: Each document's loglikelihood/perplexity is computed *separately*, unlike other implementations
      which may simply concatenate multiple documents together.
    - IMPORTANT: We maximize the amount of context for each prediction. Specifically, for inputs that we break into
      multiple chunks, the last input will still a full-sized context.

    Example:
        ```text
        Input tokens: [ 0 1 2 3 4 5 6 7 8 9 ]
        Prefix: BOS/EOS
        Max context length: 4
        Resulting input/prediction pairs:

            INPUT:  BOS   0   1   2
            PRED:     0   1   2   3

            INPUT:    3   4   5   6
            PRED:     4   5   6   7

            INPUT:    5   6   7   8
            PRED:             8   9

        Observe that:
          1. Each token is predicted exactly once
          2. For the last pair, we provide the full context, but only score the last two tokens
        ```

    Args:
        requests: List of ``Instance`` objects. Each ``Instance.args`` is a ``(Literal[""], string)`` tuple containing
            the text whose overall log-likelihood is computed.
            Context is always an empty string to keep the interface consistent with ``loglikelihood``.

    Returns:
        A list of ``(logprob, Literal[False])`` tuples — the log-probability of the string
        conditioned on the BOS/EOS token (or ``prefix_token_id``).
        The second element is always False since this method does not compute greedy likelihood.
    """
    ...

generate_until abstractmethod

generate_until(requests: Sequence[GenInstance]) -> list[str]

Generate greedily until a stopping sequence.

PARAMETER DESCRIPTION
requests

List of Instance objects. Each Instance.args is a (context, gen_kwargs) tuple. context: str — the conditioning text. gen_kwargs: str — generation keyword arguments (e.g. temperature, until).

TYPE: Sequence[GenInstance]

RETURNS DESCRIPTION
list[str]

A list of generated continuation strings, one per request.

Source code in lm_eval/api/model.py
@abc.abstractmethod
def generate_until(self, requests: Sequence[GenInstance]) -> list[str]:
    """Generate greedily until a stopping sequence.

    Args:
        requests: List of ``Instance`` objects. Each ``Instance.args`` is a ``(context, gen_kwargs)`` tuple.
            *context*: str — the conditioning text.
            *gen_kwargs*: str — generation keyword arguments (e.g. ``temperature``, ``until``).

    Returns:
        A list of generated continuation strings, one per request.
    """
    ...

apply_chat_template

apply_chat_template(chat_history: Sequence[dict[str, str]], add_generation_prompt=True) -> str | list[dict[str, str]]

Transform few-shot chat history into a string prompt for the model.

PARAMETER DESCRIPTION
chat_history

Messages as [{"role": ..., "content": ...}, ...] dicts.

TYPE: Sequence[dict[str, str]]

add_generation_prompt

Whether to append an assistant generation prefix (e.g. <|assistant|>). Set to False when prefilling an assistant message.

DEFAULT: True

RETURNS DESCRIPTION
str | list[dict[str, str]]

The formatted prompt string, or a list of message dicts if the model handles templating internally.

Source code in lm_eval/api/model.py
def apply_chat_template(
    self, chat_history: Sequence[dict[str, str]], add_generation_prompt=True
) -> str | list[dict[str, str]]:
    """Transform few-shot chat history into a string prompt for the model.

    Args:
        chat_history: Messages as ``[{"role": ..., "content": ...}, ...]`` dicts.
        add_generation_prompt: Whether to append an assistant generation prefix
            (e.g. ``<|assistant|>``). Set to False when prefilling an assistant message.

    Returns:
        The formatted prompt string, or a list of message dicts if the model handles templating internally.
    """
    raise NotImplementedError(
        "To use this model with chat templates, please implement the 'apply_chat_template' method for your model type."
    )

create_from_arg_string classmethod

create_from_arg_string(arg_string: str, additional_config: dict | None = None) -> Self

Create an LM instance from a comma-separated argument string.

PARAMETER DESCRIPTION
arg_string

Arguments as "key1=value1,key2=value2".

TYPE: str

additional_config

Extra configuration merged into the parsed args.

TYPE: dict | None DEFAULT: None

RETURNS DESCRIPTION
Self

An instance of this LM subclass.

Source code in lm_eval/api/model.py
@classmethod
def create_from_arg_string(
    cls, arg_string: str, additional_config: dict | None = None
) -> Self:
    """Create an LM instance from a comma-separated argument string.

    Args:
        arg_string: Arguments as ``"key1=value1,key2=value2"``.
        additional_config: Extra configuration merged into the parsed args.

    Returns:
        An instance of this LM subclass.
    """
    additional_config = {} if additional_config is None else additional_config
    args = utils.simple_parse_args_string(arg_string)
    args2 = {k: v for k, v in additional_config.items() if v is not None}
    return cls(**args, **args2)

create_from_arg_obj classmethod

create_from_arg_obj(arg_dict: dict[str, Any], additional_config: dict[str, Any] | None = None) -> Self

Create an LM instance from a dictionary of arguments.

PARAMETER DESCRIPTION
arg_dict

Keyword arguments forwarded to the constructor.

TYPE: dict[str, Any]

additional_config

Extra configuration merged into arg_dict.

TYPE: dict[str, Any] | None DEFAULT: None

RETURNS DESCRIPTION
Self

An instance of this LM subclass.

Source code in lm_eval/api/model.py
@classmethod
def create_from_arg_obj(
    cls,
    arg_dict: dict[str, Any],
    additional_config: dict[str, Any] | None = None,
) -> Self:
    """Create an LM instance from a dictionary of arguments.

    Args:
        arg_dict: Keyword arguments forwarded to the constructor.
        additional_config: Extra configuration merged into ``arg_dict``.

    Returns:
        An instance of this LM subclass.
    """
    additional_config = (
        {}
        if additional_config is None
        else {k: v for k, v in additional_config.items() if v is not None}
    )

    return cls(**arg_dict, **additional_config)

all_gather

all_gather(tensor)

All-gather a tensor across ranks.

Returns concatenated tensor from all ranks. Default: no-op.

Source code in lm_eval/api/model.py
def all_gather(self, tensor):
    """All-gather a tensor across ranks.

    Returns concatenated tensor from all ranks. Default: no-op.
    """
    return tensor

barrier

barrier() -> None

Synchronization barrier. Default: no-op.

Source code in lm_eval/api/model.py
def barrier(self) -> None:
    """Synchronization barrier. Default: no-op."""
    return

chat_template

chat_template(chat_template: bool | str = False) -> str | None

Return the chat template string for this model.

Override in subclasses to define a specific format. Returns empty string by default (no chat template).

Source code in lm_eval/api/model.py
def chat_template(self, chat_template: bool | str = False) -> str | None:
    """Return the chat template string for this model.

    Override in subclasses to define a specific format. Returns empty string
    by default (no chat template).
    """
    return ""

set_cache_hook

set_cache_hook(cache_hook: CacheHook) -> None
Source code in lm_eval/api/model.py
def set_cache_hook(self, cache_hook: CacheHook) -> None:
    self.cache_hook = cache_hook

TemplateLM provides common tokenization and chat template logic. Most built-in backends extend this rather than LM directly.

TemplateLM

TemplateLM()

Bases: LM


              flowchart TD
              lm_eval.api.model.TemplateLM[TemplateLM]
              lm_eval.api.model.LM[LM]

                              lm_eval.api.model.LM --> lm_eval.api.model.TemplateLM
                


              click lm_eval.api.model.TemplateLM href "" "lm_eval.api.model.TemplateLM"
              click lm_eval.api.model.LM href "" "lm_eval.api.model.LM"
            

LM subclass that provides shared tokenization and scoring boilerplate.

Handles context/continuation encoding, empty-context logic, and delegates token-level scoring to _loglikelihood_tokens.

Source code in lm_eval/api/model.py
def __init__(self) -> None:
    # set rank and world size to a single process, by default.
    self._rank = 0
    self._world_size = 1
    self._device = None
    self.cache_hook: CacheHook = CacheHook(None)

Attributes

tokenizer class-attribute instance-attribute

tokenizer = None

backend class-attribute instance-attribute

backend = 'causal'

eot_token_id abstractmethod property

eot_token_id: int

prefix_token_id property

prefix_token_id

Functions

tok_encode abstractmethod

tok_encode(string: str, add_special_tokens: bool | None = None, **kwargs) -> list[int]

Tokenize a string and return a list of token IDs.

Must handle strings that already contain the BOS token when add_special_tokens is None. Otherwise, uses the flag as given.

Source code in lm_eval/api/model.py
@abc.abstractmethod
def tok_encode(
    self, string: str, add_special_tokens: bool | None = None, **kwargs
) -> list[int]:
    """Tokenize a string and return a list of token IDs.

    Must handle strings that already contain the BOS token when
    ``add_special_tokens`` is None. Otherwise, uses the flag as given.
    """
    ...

loglikelihood

loglikelihood(requests: Sequence[LLInstance], disable_tqdm: bool = False) -> list[LLOutput]

Compute log-likelihood of continuations given contexts.

Tokenizes each (context, continuation) pair and delegates to _loglikelihood_tokens. Empty contexts use prefix_token_id (typically BOS/EOS) as the conditioning token.

PARAMETER DESCRIPTION
requests

List of Instance objects. Each Instance.args is a (context, continuation) tuple.

TYPE: Sequence[LLInstance]

disable_tqdm

Whether to suppress the progress bar.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
list[LLOutput]

A list of (logprob, is_greedy) tuples, one per request.

Source code in lm_eval/api/model.py
def loglikelihood(
    self, requests: Sequence[LLInstance], disable_tqdm: bool = False
) -> list[LLOutput]:
    """Compute log-likelihood of continuations given contexts.

    Tokenizes each ``(context, continuation)`` pair and delegates to
    ``_loglikelihood_tokens``. Empty contexts use ``prefix_token_id``
    (typically BOS/EOS) as the conditioning token.

    Args:
        requests: List of ``Instance`` objects. Each ``Instance.args`` is a ``(context, continuation)`` tuple.
        disable_tqdm: Whether to suppress the progress bar.

    Returns:
        A list of ``(logprob, is_greedy)`` tuples, one per request.
    """
    new_reqs = []
    for context, continuation in [req.args for req in requests]:
        if context == "":
            continuation_enc = self.tok_encode(
                continuation, add_special_tokens=False
            )
            # BOS or EOS as context: handle when context is empty -> (context + continuation) -> (BOS + continuation
            context_enc, continuation_enc = (
                ([self.prefix_token_id], continuation_enc)
                if self.prefix_token_id != continuation_enc[0]
                else (continuation_enc[:1], continuation_enc[1:])
            )
            # BOS or EOS as context
        else:
            context_enc, continuation_enc = self._encode_pair(context, continuation)

        new_reqs.append(((context, continuation), context_enc, continuation_enc))

    return self._loglikelihood_tokens(new_reqs, disable_tqdm=disable_tqdm)

loglikelihood_rolling abstractmethod

loglikelihood_rolling(requests: Sequence[LLInstance], disable_tqdm: bool = False) -> list[LLOutput]
Source code in lm_eval/api/model.py
@abc.abstractmethod
def loglikelihood_rolling(
    self, requests: Sequence[LLInstance], disable_tqdm: bool = False
) -> list[LLOutput]: ...

generate_until abstractmethod

generate_until(requests: Sequence[GenInstance], disable_tqdm: bool = False) -> list[str]
Source code in lm_eval/api/model.py
@abc.abstractmethod
def generate_until(
    self, requests: Sequence[GenInstance], disable_tqdm: bool = False
) -> list[str]: ...

chat_template

chat_template(chat_template: bool | str = False) -> str | None

Select and return the appropriate chat template for this model.

Resolution order (adapted from Transformers apply_chat_template):

  • No tokenizer — returns the empty string (template handled by provider).
  • Tokenizer has a dict of templates — use the named or "default" entry.
  • Tokenizer has a single template — use it, falling back to default_chat_template if unset.
PARAMETER DESCRIPTION
chat_template

False/None to disable, True to auto-select, or a string name to pick a specific template from a dict.

TYPE: bool | str DEFAULT: False

RETURNS DESCRIPTION
str | None

The selected template string, or None if disabled.

Source code in lm_eval/api/model.py
def chat_template(self, chat_template: bool | str = False) -> str | None:
    """Select and return the appropriate chat template for this model.

    Resolution order (adapted from Transformers ``apply_chat_template``):

    * No tokenizer — returns the empty string (template handled by provider).
    * Tokenizer has a dict of templates — use the named or ``"default"`` entry.
    * Tokenizer has a single template — use it, falling back to
      ``default_chat_template`` if unset.

    Args:
        chat_template: ``False``/``None`` to disable, ``True`` to auto-select,
            or a string name to pick a specific template from a dict.

    Returns:
        The selected template string, or ``None`` if disabled.
    """
    if self.tokenizer is None:
        return ""

    if chat_template is False or chat_template is None:
        eval_logger.warning(
            "model.chat_template was called with the chat_template set to False or None. "
            "Therefore no chat template will be applied. Make sure this is an intended behavior."
        )
        return None

    # Convert boolean chat_template to None to ensure compatibility with the adapted logic
    if isinstance(chat_template, bool):
        chat_template = None
    using_default_template = False

    # First, handle the cases when the model has a dict of multiple templates
    try:
        template = (
            self.tokenizer.chat_template or self.tokenizer.default_chat_template
        )
    except AttributeError:
        return None

    if isinstance(template, dict):
        using_default_dict = self.tokenizer.chat_template is None

        if chat_template is not None:
            if chat_template in template:
                selected_template = template[chat_template]
                if using_default_dict:
                    using_default_template = True
            else:
                raise ValueError(
                    f"The specified chat template '{chat_template}' is not available. "
                    f"Available template names are {sorted(template.keys())}."
                )
        else:
            # If user didn't pass a chat template, use the default template from the dict
            if "default" in template:
                selected_template = template["default"]
                using_default_template = True
            else:
                raise ValueError(
                    "This model has multiple chat templates with no default specified! Please either pass a chat "
                    "template or the name of the template you wish to use to the `chat_template` argument. Available "
                    f"template names are {sorted(template.keys())}."
                )

    # Cases when the model has a single template or no template
    else:
        # priority: `chat_template` argument > `tokenizer.chat_template` > `tokenizer.default_chat_template
        if isinstance(chat_template, str):
            eval_logger.warning(
                "Chat template name provided, but the tokenizer's chat template is not a dictionary. "
                "Using the tokenizer's chat template or the default template instead."
            )
        if self.tokenizer.chat_template is not None:
            selected_template = self.tokenizer.chat_template
        else:
            selected_template = self.tokenizer.default_chat_template
            using_default_template = True

    if using_default_template:
        eval_logger.warning(
            "No chat template is set for this tokenizer, falling back to a default class-level template. This is "
            "very error-prone, because models are often trained with templates different from the class default! "
            "Default chat templates are a legacy feature and will be removed in Transformers v4.43, at which "
            "point any code depending on them will stop working. We recommend setting a valid chat template before "
            "then to ensure that this model continues working without issues."
        )

    return selected_template

CachingLM wraps any LM instance to add response caching.

CachingLM

CachingLM(lm: LM, cache_db: str)

LM wrapper that returns cached results when available, falling back to the underlying model.

PARAMETER DESCRIPTION
lm

The underlying language model to wrap.

TYPE: LM

cache_db

Path to the SQLite cache database.

TYPE: str

Source code in lm_eval/api/model.py
def __init__(self, lm: LM, cache_db: str) -> None:
    """LM wrapper that returns cached results when available, falling back to the underlying model.

    Args:
        lm: The underlying language model to wrap.
        cache_db: Path to the SQLite cache database.
    """
    from sqlitedict import SqliteDict

    self.lm: LM = lm
    self.cache_db: str = cache_db
    if os.path.dirname(cache_db):
        os.makedirs(os.path.dirname(cache_db), exist_ok=True)
    self.dbdict = SqliteDict(cache_db, autocommit=True)

    # add hook to lm
    lm.set_cache_hook(self.get_cache_hook())

Attributes

lm instance-attribute

lm: LM = lm

cache_db instance-attribute

cache_db: str = cache_db

dbdict instance-attribute

dbdict = SqliteDict(cache_db, autocommit=True)

Functions

__getattr__

__getattr__(attr: str) -> Any
Source code in lm_eval/api/model.py
def __getattr__(self, attr: str) -> Any:
    lm_attr = getattr(self.lm, attr)
    if attr not in ["loglikelihood", "loglikelihood_rolling", "generate_until"]:
        eval_logger.debug("Passing through attribute '%s' to underlying LM", attr)
        return lm_attr

    def _fn(requests: list[Instance]) -> list[Instance]:
        res = []
        remaining_reqs = []
        warned = False
        # figure out which ones are cached and which ones are new
        eval_logger.info(
            "Loading '%s' responses from cache '%s' where possible...",
            attr,
            self.cache_db,
        )
        for req in tqdm(requests, desc="Checking cached requests"):
            hsh = hash_args(attr, req.args)
            if attr == "generate_until" and req.args[1].get("do_sample", False):
                # when we are doing non-greedy generation, don't use the cache
                # (else every "randomly sampled" generation would be identical for repeats > 1).
                if not warned:
                    eval_logger.warning(
                        "Arguments to lm.generate_until() '%s' include non-deterministic sampling. Caching will not be performed for such requests.",
                        req.args[1],
                    )
                    warned = True
                res.append(None)
                remaining_reqs.append(req)
            elif hsh in self.dbdict:
                ob = self.dbdict[hsh]

                assert ob is not None

                res.append(ob)
            else:
                res.append(None)
                remaining_reqs.append(req)
        eval_logger.info(
            "Cached requests: %d, Requests remaining: %d",
            len(requests) - len(remaining_reqs),
            len(remaining_reqs),
        )
        # actually run the LM on the requests that do not have cached results
        rem_res = getattr(self.lm, attr)(remaining_reqs) if remaining_reqs else []

        # stick the new ones back into the list and also cache any of the new ones
        resptr = 0
        for req, r in zip(remaining_reqs, rem_res, strict=True):
            while res[resptr] is not None:
                resptr += 1

            res[resptr] = r

            # caching
            hsh = hash_args(attr, req.args)
            self.dbdict[hsh] = r
        self.dbdict.commit()

        return res

    return _fn

get_cache_hook

get_cache_hook() -> CacheHook
Source code in lm_eval/api/model.py
def get_cache_hook(self) -> CacheHook:
    return CacheHook(self)