nielsr HF Staff commited on
Commit
685f8cc
·
verified ·
1 Parent(s): ac12522

Add metadata, paper and GitHub links

Browse files

Hi! I'm Niels from the Hugging Face community team.

I've opened this PR to enhance the model card for this simulator:
- Added `pipeline_tag: text-generation` and `library_name: transformers` to the metadata to improve discoverability.
- Linked the model to its research paper: [Training Language Models to Explain Their Own Computations](https://huggingface.co/papers/2511.08579).
- Added a link to the official GitHub repository.

These changes help users understand the model's context and ensure it's correctly categorized on the Hugging Face Hub.

Files changed (1) hide show
  1. README.md +26 -9
README.md CHANGED
@@ -1,34 +1,51 @@
1
  ---
2
- license: mit
3
- language:
4
- - en
5
  base_model:
6
  - meta-llama/Llama-3.1-8B-Instruct
 
 
 
 
 
7
  ---
8
 
9
  # Model Card
10
 
11
- This is a **simulator model** used to score candidate natural-language explanations of internal features in Llama-3.1-8B. Given:
12
 
 
13
  - an input text sequence `x` (tokenized),
14
  - a candidate explanation `E` (e.g., “encodes city names”),
15
 
16
- the simulator predicts **where the described feature should activate** in the sequence (token-level activation scores). These simulated activations can then be compared to a target feature’s *true* activations, enabling scoring of the explanations by computing correlation (the "simulator score" / correlation objective described in [the paper](https://arxiv.org/abs/2511.08579)).
 
 
 
17
 
18
  ---
19
  ## Usage
20
 
21
- **Note:** This simulator is not usable via standard `transformers` APIs alone. You must first **clone and install [our repository](https://github.com/TransluceAI/introspective-interp/tree/main#)**, which provides the custom simulator wrapper and scoring utilities.
22
-
23
 
24
  ```python
25
  from observatory_utils.simulator import FinetunedSimulator
26
  simulator = FinetunedSimulator.setup(
27
  model_path="Transluce/features_explain_llama3.1_8b_simulator",
28
  add_special_tokens=True,
29
- gpu_idx=simulator_device_idx, # e.g. 0
30
  tokenizer_path="meta-llama/Llama-3.1-8B",
31
- cache_dir=config.get("cache_dir", None),
32
  )
33
  ```
34
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
 
 
2
  base_model:
3
  - meta-llama/Llama-3.1-8B-Instruct
4
+ language:
5
+ - en
6
+ license: mit
7
+ library_name: transformers
8
+ pipeline_tag: text-generation
9
  ---
10
 
11
  # Model Card
12
 
13
+ This is a **simulator model** used to score candidate natural-language explanations of internal features in Llama-3.1-8B. It was introduced in the paper [Training Language Models to Explain Their Own Computations](https://huggingface.co/papers/2511.08579).
14
 
15
+ Given:
16
  - an input text sequence `x` (tokenized),
17
  - a candidate explanation `E` (e.g., “encodes city names”),
18
 
19
+ the simulator predicts **where the described feature should activate** in the sequence (token-level activation scores). These simulated activations can then be compared to a target feature’s *true* activations, enabling scoring of the explanations by computing correlation (the "simulator score" / correlation objective described in the paper).
20
+
21
+ - **Code:** [https://github.com/TransluceAI/introspective-interp](https://github.com/TransluceAI/introspective-interp)
22
+ - **Paper:** [Training Language Models to Explain Their Own Computations](https://huggingface.co/papers/2511.08579)
23
 
24
  ---
25
  ## Usage
26
 
27
+ **Note:** This simulator is not usable via standard `transformers` APIs alone. You must first **clone and install [the repository](https://github.com/TransluceAI/introspective-interp/tree/main#)**, which provides the custom simulator wrapper and scoring utilities.
 
28
 
29
  ```python
30
  from observatory_utils.simulator import FinetunedSimulator
31
  simulator = FinetunedSimulator.setup(
32
  model_path="Transluce/features_explain_llama3.1_8b_simulator",
33
  add_special_tokens=True,
34
+ gpu_idx=0, # e.g. 0
35
  tokenizer_path="meta-llama/Llama-3.1-8B",
 
36
  )
37
  ```
38
 
39
+ ## Citation
40
+
41
+ ```bibtex
42
+ @misc{li2025traininglanguagemodelsexplain,
43
+ title={Training Language Models to Explain Their Own Computations},
44
+ author={Belinda Z. Li and Zifan Carl Guo and Vincent Huang and Jacob Steinhardt and Jacob Andreas},
45
+ year={2025},
46
+ eprint={2511.08579},
47
+ archivePrefix={arXiv},
48
+ primaryClass={cs.CL},
49
+ url={https://arxiv.org/abs/2511.08579},
50
+ }
51
+ ```