|
|
--- |
|
|
base_model: |
|
|
- meta-llama/Llama-3.1-8B-Instruct |
|
|
language: |
|
|
- en |
|
|
license: mit |
|
|
library_name: transformers |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# Model Card |
|
|
|
|
|
This is a **simulator model** used to score candidate natural-language explanations of internal features in Llama-3.1-8B. It was introduced in the paper [Training Language Models to Explain Their Own Computations](https://huggingface.co/papers/2511.08579). |
|
|
|
|
|
Given: |
|
|
- an input text sequence `x` (tokenized), |
|
|
- a candidate explanation `E` (e.g., “encodes city names”), |
|
|
|
|
|
the simulator predicts **where the described feature should activate** in the sequence (token-level activation scores). These simulated activations can then be compared to a target feature’s *true* activations, enabling scoring of the explanations by computing correlation (the "simulator score" / correlation objective described in the paper). |
|
|
|
|
|
- **Code:** [https://github.com/TransluceAI/introspective-interp](https://github.com/TransluceAI/introspective-interp) |
|
|
- **Paper:** [Training Language Models to Explain Their Own Computations](https://huggingface.co/papers/2511.08579) |
|
|
|
|
|
--- |
|
|
## Usage |
|
|
|
|
|
**Note:** This simulator is not usable via standard `transformers` APIs alone. You must first **clone and install [the repository](https://github.com/TransluceAI/introspective-interp/tree/main#)**, which provides the custom simulator wrapper and scoring utilities. |
|
|
|
|
|
```python |
|
|
from observatory_utils.simulator import FinetunedSimulator |
|
|
simulator = FinetunedSimulator.setup( |
|
|
model_path="Transluce/features_explain_llama3.1_8b_simulator", |
|
|
add_special_tokens=True, |
|
|
gpu_idx=0, # e.g. 0 |
|
|
tokenizer_path="meta-llama/Llama-3.1-8B", |
|
|
) |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{li2025traininglanguagemodelsexplain, |
|
|
title={Training Language Models to Explain Their Own Computations}, |
|
|
author={Belinda Z. Li and Zifan Carl Guo and Vincent Huang and Jacob Steinhardt and Jacob Andreas}, |
|
|
year={2025}, |
|
|
eprint={2511.08579}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL}, |
|
|
url={https://arxiv.org/abs/2511.08579}, |
|
|
} |
|
|
``` |