belindazli commited on
Commit
9cd532f
·
verified ·
1 Parent(s): f9fe0e9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +83 -0
README.md CHANGED
@@ -5,3 +5,86 @@ language:
5
  base_model:
6
  - meta-llama/Llama-3.1-8B-Instruct
7
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  base_model:
6
  - meta-llama/Llama-3.1-8B-Instruct
7
  ---
8
+
9
+ # Model Card
10
+
11
+ This is a Llama-3.1-8B-Instruct model fine-tuned to explain continuous features from Llama-3.1-8B.
12
+ This model was trained to map SAE features from Llama-3.1-8B's residual stream to their explanations derived from Neuronpedia.
13
+ It generalizes to explaining any arbitrary continuous feature from Llama-3.1-8B's residual stream.
14
+
15
+ See [paper](https://arxiv.org/abs/2511.08579) for more details.
16
+
17
+ ## Usage
18
+
19
+ Use the code below to get started with the model.
20
+
21
+ **Note**: This model requires custom handling of continuous tokens. For full functionality, you'll need to use the custom model classes from [this repository](TODO) that can properly embed feature vectors at the `<|reserved_special_token_12|>` tokens. The standard transformers library won't handle the continuous token embeddings correctly.
22
+
23
+ ```python
24
+ import torch
25
+ import numpy as np
26
+ from transformers import AutoTokenizer
27
+
28
+ # Load the continuous model class
29
+ from model.continuous_llama import ContinuousLlama
30
+
31
+ # Load the model and tokenizer
32
+ model_name = "Transluce/features_explain_llama3.1_8b_llama3.1_8b_instruct"
33
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
34
+ model = ContinuousLlama.from_pretrained(
35
+ model_name,
36
+ torch_dtype=torch.bfloat16,
37
+ special_tokens_ids={
38
+ "begin_continuous": tokenizer.convert_tokens_to_ids("<|reserved_special_token_10|>"),
39
+ "end_continuous": tokenizer.convert_tokens_to_ids("<|reserved_special_token_11|>"),
40
+ "continuous_rep": tokenizer.convert_tokens_to_ids("<|reserved_special_token_12|>")
41
+ }
42
+ )
43
+
44
+ # Example: explaining a continuous feature from layer 15
45
+ layer = 15
46
+ feature_vector = torch.randn(4096) # Feature from Llama-3.1-8B's residual stream
47
+
48
+ # Format the prompt with continuous tokens
49
+ prompt = f"At layer {layer}, <|reserved_special_token_10|><|reserved_special_token_12|><|reserved_special_token_11|> encodes "
50
+
51
+ # Tokenize the prompt
52
+ inputs = tokenizer(prompt, return_tensors="pt")
53
+
54
+ # Create continuous token inputs for the feature vector
55
+ continuous_tokens = {
56
+ "inputs_continuous_tokens": feature_vector.unsqueeze(0), # Add batch dimension
57
+ "labels_continuous_tokens": None # Not needed for generation
58
+ }
59
+
60
+ # Generate explanation
61
+ with torch.no_grad():
62
+ outputs = model.generate(
63
+ input_ids=inputs.input_ids,
64
+ attention_mask=inputs.attention_mask,
65
+ max_new_tokens=128,
66
+ do_sample=False,
67
+ pad_token_id=tokenizer.eos_token_id,
68
+ **continuous_tokens
69
+ )
70
+
71
+ # Decode the explanation
72
+ explanation = tokenizer.decode(outputs[0], skip_special_tokens=True)
73
+ print(explanation)
74
+ ```
75
+
76
+
77
+ ## Citation
78
+
79
+ **BibTeX:**
80
+ ```
81
+ @misc{li2025traininglanguagemodelsexplain,
82
+ title={Training Language Models to Explain Their Own Computations},
83
+ author={Belinda Z. Li and Zifan Carl Guo and Vincent Huang and Jacob Steinhardt and Jacob Andreas},
84
+ year={2025},
85
+ eprint={2511.08579},
86
+ archivePrefix={arXiv},
87
+ primaryClass={cs.CL},
88
+ url={https://arxiv.org/abs/2511.08579},
89
+ }
90
+ ```