Skip to content

Filters

Filters are post-processing steps applied to raw model outputs before scoring. They let you extract answers, clean text, subset responses, or ensemble over multiple generations — all configurable in YAML.

How filters work

After the model runs on each Instance, raw responses are stored in Instance.resps. Filters process these responses before they reach the scoring pipeline:

Raw model responses → Filter pipeline → Filtered responses → Scorer

Filters operate on a list of responses per document. A single filter step transforms this list (e.g., apply a regex to each response), and a pipeline chains multiple steps together.

Basic filter configuration

Use filter_list in your task YAML to define one or more filter pipelines:

filter_list:
  - name: "get-answer"
    filter:
      - function: "regex"
        regex_pattern: "The answer is (\\-?[0-9\\.\\,]+)"
      - function: "take_first"

Each pipeline has:

  • name — identifier for this pipeline (appears in results table)
  • filter — ordered list of filter steps to apply

Each filter step (FilterStep) has:

  • function — name of a registered filter (e.g., regex, take_first, majority_vote)
  • kwargs — optional keyword arguments passed to the filter (can also be specified as flat keys alongside function — both forms are accepted)

Built-in filters

A full list is available in lm_eval/filters/__init__.py. Common ones include:

Filter Description Key kwargs
noop No-op passthrough (identity)
take_first Select only the first response per document
take_first_k Select the first k responses k
regex Apply regex extraction to each response regex_pattern, group_select
remove_whitespace Strip whitespace from each response
lowercase Lowercase each response
majority_vote Return the most common response
map Apply a Python function to each response mapping_dict
custom Apply a custom function filter_fn

Multiple filter pipelines

Tasks can define multiple filter pipelines that run on the same model outputs. Each pipeline produces its own set of filtered responses and metric scores.

This is powerful for comparing different answer-extraction strategies or for self-consistency evaluation.

Example: GSM8K with self-consistency

This task generates 64 chain-of-thought outputs per problem, then scores them three different ways:

repeats: 64
filter_list:
  - name: "score-first"
    filter:
      - function: "regex"
        regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
      - function: "take_first"

  - name: "maj@64"
    filter:
      - function: "regex"
        regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
      - function: "majority_vote"
      - function: "take_first"

  - name: "maj@8"
    filter:
      - function: "take_first_k"
        k: 8
      - function: "regex"
        regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
      - function: "majority_vote"
      - function: "take_first"

score-first: Extract the answer from the first generation only.

maj@64: Extract answers from all 64 generations, majority vote across them.

maj@8: Subset to the first 8 generations, then extract and majority vote.

All three pipelines produce separate metric rows in the results table — from one set of model outputs.

Per-pipeline metrics

Each pipeline can specify its own metrics:

filter_list:
  - name: "strict"
    filter:
      - function: "regex"
        regex_pattern: "(\\d+)"
      - function: "take_first"
    metric_list:
      - metric: exact_match

  - name: "flexible"
    filter:
      - function: "take_first"
    metric_list:
      - metric: exact_match
        ignore_case: true

Pipelines without a metric_list inherit the task-level metric_list.

Filter step execution

Filter steps execute in order, each receiving the output of the previous step. The contract:

  • Input: list[list[response]] — a list of response lists, one per document
  • Output: Same shape — the filter transforms responses but maintains the per-document grouping

The final step must reduce each document's response list to a single response (typically via take_first), which is then passed to the scorer.

Adding a custom filter

Register a custom filter with the @register_filter decorator:

from lm_eval.api.filter import Filter
from lm_eval.api.registry import register_filter

@register_filter("my_filter")
class MyFilter(Filter):
    def apply(self, resps, docs):
        # resps: list[list[str]] — responses grouped by document
        # docs: list[dict] — corresponding documents
        # Return: list[list[str]] — filtered responses
        return [[r.strip() for r in doc_resps] for doc_resps in resps]

Then reference it in YAML:

filter_list:
  - name: "my-pipeline"
    filter:
      - function: "my_filter"
      - function: "take_first"

For more details on implementing custom filters, see Custom Metrics & Filters.