Update README.md
Browse files
README.md
CHANGED
|
@@ -5,3 +5,86 @@ language:
|
|
| 5 |
base_model:
|
| 6 |
- meta-llama/Llama-3.1-8B-Instruct
|
| 7 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
base_model:
|
| 6 |
- meta-llama/Llama-3.1-8B-Instruct
|
| 7 |
---
|
| 8 |
+
|
| 9 |
+
# Model Card
|
| 10 |
+
|
| 11 |
+
This is a Llama-3.1-8B-Instruct model fine-tuned to explain continuous features from Llama-3.1-8B.
|
| 12 |
+
This model was trained to map SAE features from Llama-3.1-8B's residual stream to their explanations derived from Neuronpedia.
|
| 13 |
+
It generalizes to explaining any arbitrary continuous feature from Llama-3.1-8B's residual stream.
|
| 14 |
+
|
| 15 |
+
See [paper](https://arxiv.org/abs/2511.08579) for more details.
|
| 16 |
+
|
| 17 |
+
## Usage
|
| 18 |
+
|
| 19 |
+
Use the code below to get started with the model.
|
| 20 |
+
|
| 21 |
+
**Note**: This model requires custom handling of continuous tokens. For full functionality, you'll need to use the custom model classes from [this repository](TODO) that can properly embed feature vectors at the `<|reserved_special_token_12|>` tokens. The standard transformers library won't handle the continuous token embeddings correctly.
|
| 22 |
+
|
| 23 |
+
```python
|
| 24 |
+
import torch
|
| 25 |
+
import numpy as np
|
| 26 |
+
from transformers import AutoTokenizer
|
| 27 |
+
|
| 28 |
+
# Load the continuous model class
|
| 29 |
+
from model.continuous_llama import ContinuousLlama
|
| 30 |
+
|
| 31 |
+
# Load the model and tokenizer
|
| 32 |
+
model_name = "Transluce/features_explain_llama3.1_8b_llama3.1_8b_instruct"
|
| 33 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 34 |
+
model = ContinuousLlama.from_pretrained(
|
| 35 |
+
model_name,
|
| 36 |
+
torch_dtype=torch.bfloat16,
|
| 37 |
+
special_tokens_ids={
|
| 38 |
+
"begin_continuous": tokenizer.convert_tokens_to_ids("<|reserved_special_token_10|>"),
|
| 39 |
+
"end_continuous": tokenizer.convert_tokens_to_ids("<|reserved_special_token_11|>"),
|
| 40 |
+
"continuous_rep": tokenizer.convert_tokens_to_ids("<|reserved_special_token_12|>")
|
| 41 |
+
}
|
| 42 |
+
)
|
| 43 |
+
|
| 44 |
+
# Example: explaining a continuous feature from layer 15
|
| 45 |
+
layer = 15
|
| 46 |
+
feature_vector = torch.randn(4096) # Feature from Llama-3.1-8B's residual stream
|
| 47 |
+
|
| 48 |
+
# Format the prompt with continuous tokens
|
| 49 |
+
prompt = f"At layer {layer}, <|reserved_special_token_10|><|reserved_special_token_12|><|reserved_special_token_11|> encodes "
|
| 50 |
+
|
| 51 |
+
# Tokenize the prompt
|
| 52 |
+
inputs = tokenizer(prompt, return_tensors="pt")
|
| 53 |
+
|
| 54 |
+
# Create continuous token inputs for the feature vector
|
| 55 |
+
continuous_tokens = {
|
| 56 |
+
"inputs_continuous_tokens": feature_vector.unsqueeze(0), # Add batch dimension
|
| 57 |
+
"labels_continuous_tokens": None # Not needed for generation
|
| 58 |
+
}
|
| 59 |
+
|
| 60 |
+
# Generate explanation
|
| 61 |
+
with torch.no_grad():
|
| 62 |
+
outputs = model.generate(
|
| 63 |
+
input_ids=inputs.input_ids,
|
| 64 |
+
attention_mask=inputs.attention_mask,
|
| 65 |
+
max_new_tokens=128,
|
| 66 |
+
do_sample=False,
|
| 67 |
+
pad_token_id=tokenizer.eos_token_id,
|
| 68 |
+
**continuous_tokens
|
| 69 |
+
)
|
| 70 |
+
|
| 71 |
+
# Decode the explanation
|
| 72 |
+
explanation = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 73 |
+
print(explanation)
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
## Citation
|
| 78 |
+
|
| 79 |
+
**BibTeX:**
|
| 80 |
+
```
|
| 81 |
+
@misc{li2025traininglanguagemodelsexplain,
|
| 82 |
+
title={Training Language Models to Explain Their Own Computations},
|
| 83 |
+
author={Belinda Z. Li and Zifan Carl Guo and Vincent Huang and Jacob Steinhardt and Jacob Andreas},
|
| 84 |
+
year={2025},
|
| 85 |
+
eprint={2511.08579},
|
| 86 |
+
archivePrefix={arXiv},
|
| 87 |
+
primaryClass={cs.CL},
|
| 88 |
+
url={https://arxiv.org/abs/2511.08579},
|
| 89 |
+
}
|
| 90 |
+
```
|