Add metadata, paper and GitHub links

685f8cc verified about 1 month ago

2.07 kB

base_model:
  - meta-llama/Llama-3.1-8B-Instruct
language:
  - en
license: mit
library_name: transformers
pipeline_tag: text-generation

Model Card

This is a simulator model used to score candidate natural-language explanations of internal features in Llama-3.1-8B. It was introduced in the paper Training Language Models to Explain Their Own Computations.

Given:

an input text sequence x (tokenized),
a candidate explanation E (e.g., “encodes city names”),

the simulator predicts where the described feature should activate in the sequence (token-level activation scores). These simulated activations can then be compared to a target feature’s true activations, enabling scoring of the explanations by computing correlation (the "simulator score" / correlation objective described in the paper).

Code: https://github.com/TransluceAI/introspective-interp
Paper: Training Language Models to Explain Their Own Computations

Usage

Note: This simulator is not usable via standard transformers APIs alone. You must first clone and install the repository, which provides the custom simulator wrapper and scoring utilities.

from observatory_utils.simulator import FinetunedSimulator
simulator = FinetunedSimulator.setup(
    model_path="Transluce/features_explain_llama3.1_8b_simulator",
    add_special_tokens=True,
    gpu_idx=0,  # e.g. 0
    tokenizer_path="meta-llama/Llama-3.1-8B",
)

Citation

@misc{li2025traininglanguagemodelsexplain,
      title={Training Language Models to Explain Their Own Computations}, 
      author={Belinda Z. Li and Zifan Carl Guo and Vincent Huang and Jacob Steinhardt and Jacob Andreas},
      year={2025},
      eprint={2511.08579},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.08579}, 
}