File size: 2,074 Bytes
950061c
 
 
685f8cc
 
 
 
 
950061c
 
 
 
685f8cc
950061c
685f8cc
950061c
 
 
685f8cc
 
 
 
950061c
 
 
 
685f8cc
950061c
 
 
 
 
 
685f8cc
950061c
 
 
 
685f8cc
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
---
base_model:
- meta-llama/Llama-3.1-8B-Instruct
language:
- en
license: mit
library_name: transformers
pipeline_tag: text-generation
---

# Model Card

This is a **simulator model** used to score candidate natural-language explanations of internal features in Llama-3.1-8B. It was introduced in the paper [Training Language Models to Explain Their Own Computations](https://huggingface.co/papers/2511.08579).

Given:
- an input text sequence `x` (tokenized),
- a candidate explanation `E` (e.g., “encodes city names”),

the simulator predicts **where the described feature should activate** in the sequence (token-level activation scores). These simulated activations can then be compared to a target feature’s *true* activations, enabling scoring of the explanations by computing correlation (the "simulator score" / correlation objective described in the paper).

- **Code:** [https://github.com/TransluceAI/introspective-interp](https://github.com/TransluceAI/introspective-interp)
- **Paper:** [Training Language Models to Explain Their Own Computations](https://huggingface.co/papers/2511.08579)

---
## Usage

**Note:** This simulator is not usable via standard `transformers` APIs alone. You must first **clone and install [the repository](https://github.com/TransluceAI/introspective-interp/tree/main#)**, which provides the custom simulator wrapper and scoring utilities.

```python
from observatory_utils.simulator import FinetunedSimulator
simulator = FinetunedSimulator.setup(
    model_path="Transluce/features_explain_llama3.1_8b_simulator",
    add_special_tokens=True,
    gpu_idx=0,  # e.g. 0
    tokenizer_path="meta-llama/Llama-3.1-8B",
)
```

## Citation

```bibtex
@misc{li2025traininglanguagemodelsexplain,
      title={Training Language Models to Explain Their Own Computations}, 
      author={Belinda Z. Li and Zifan Carl Guo and Vincent Huang and Jacob Steinhardt and Jacob Andreas},
      year={2025},
      eprint={2511.08579},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.08579}, 
}
```