Common Pitfalls and Troubleshooting¶
This document highlights common pitfalls and troubleshooting tips when using the evaluation harness.
YAML Configuration Issues¶
Newline Characters in YAML (\n)¶
Problem: When specifying newline characters in YAML, they may be interpreted incorrectly depending on how you format them.
# WRONG: Single quotes don't process escape sequences
generation_kwargs:
until: ['\n'] # Gets parsed as the literal characters '\' and 'n' i.e "\\n"
# RIGHT: Use double quotes for escape sequences
generation_kwargs:
until: ["\n"] # Gets parsed as an actual newline character
Solutions:
- Use double quotes for strings containing escape sequences
- For multiline content, use YAML's block scalars (
|or>) - When generating YAML programmatically, be careful with how template engines handle escape sequences
Quoting in YAML¶
When to use different types of quotes:
- No quotes: Simple values (numbers, booleans, alphanumeric strings without special characters)
- Single quotes ('):
- Preserves literal values
- Use when you need special characters to be treated literally
- Escape single quotes by doubling them:
'It''s working'
literal_string: 'The newline character \n is not processed here'
path: 'C:\Users\name' # Backslashes preserved
- Double quotes ("):
- Processes escape sequences like
\n,\t, etc. - Use for strings that need special characters interpreted
- Escape double quotes with backslash:
"He said \"Hello\""
processed_string: "First line\nSecond line" # Creates actual newline
unicode: "Copyright symbol: \u00A9" # Unicode character
Jinja2 in YAML¶
When using Jinja2 templates in doc_to_text or other fields, be careful with curly braces:
# WRONG: Unquoted value with {{ — YAML tries to parse it as a mapping
doc_to_text: Question: {{question}}
# RIGHT: Quote the entire value
doc_to_text: "Question: {{question}}\nAnswer:"
# Also RIGHT: Use a block scalar
doc_to_text: |
Question: {{question}}
Answer:
Evaluation Issues¶
--limit is for testing only¶
The --limit flag restricts the number of examples evaluated per task. Results with --limit are not comparable to full evaluations and should never be reported as benchmark scores.
# Good: for testing your setup
lm-eval run --model hf --model_args pretrained=gpt2 --tasks hellaswag --limit 10
# Bad: reporting these results as benchmark performance
Unexpected metric values¶
If you see metrics that are very different from expected:
- Check the output type: Is your task using
multiple_choicewhen it should usegenerate_until, or vice versa? - Check few-shot count:
--num_fewshot 0and omitting--num_fewshotmay behave differently if the task YAML sets a default - Check the prompt: Use
--write_outto inspect the actual prompts sent to the model - Check the filter: Some tasks apply post-processing filters (regex extraction, etc.) that may not match your model's output format
Chat template issues¶
If instruction-tuned models perform worse than expected:
- Make sure you're using
--apply_chat_template - Check that the tokenizer includes the correct chat template
- See the Chat Templates guide for details
Debugging¶
Enable verbose logging¶
export LMEVAL_LOG_LEVEL="DEBUG"
lm-eval run --model hf --model_args pretrained=gpt2 --tasks hellaswag --limit 5
Inspect prompts¶
Use --write_out to print the actual prompts for the first few documents:
Validate task configs¶
Before running a full evaluation, validate your task configurations:
Where next?¶
- Debugging a task config? See Task Configuration Reference
- Chat template issues? See Chat Templates
- Need help with scoring or filters? See Scoring & Metrics