Transluce
/

features_explain_llama3.1_8b_llama3.1_8b_instruct

Text Generation

Model card Files Files and versions

belindazli commited on Dec 21, 2025

Commit

9cd532f

·

verified ·

1 Parent(s): f9fe0e9

Update README.md

Files changed (1) hide show

README.md +83 -0

README.md CHANGED Viewed

@@ -5,3 +5,86 @@ language:
 base_model:
 - meta-llama/Llama-3.1-8B-Instruct
 ---

 base_model:
 - meta-llama/Llama-3.1-8B-Instruct
 ---
+# Model Card
+This is a Llama-3.1-8B-Instruct model fine-tuned to explain continuous features from Llama-3.1-8B.
+This model was trained to map SAE features from Llama-3.1-8B's residual stream to their explanations derived from Neuronpedia.
+It generalizes to explaining any arbitrary continuous feature from Llama-3.1-8B's residual stream.
+See [paper](https://arxiv.org/abs/2511.08579) for more details.
+## Usage
+Use the code below to get started with the model.
+**Note**: This model requires custom handling of continuous tokens. For full functionality, you'll need to use the custom model classes from [this repository](TODO) that can properly embed feature vectors at the `<|reserved_special_token_12|>` tokens. The standard transformers library won't handle the continuous token embeddings correctly.
+```python
+import torch
+import numpy as np
+from transformers import AutoTokenizer
+# Load the continuous model class
+from model.continuous_llama import ContinuousLlama
+# Load the model and tokenizer
+model_name = "Transluce/features_explain_llama3.1_8b_llama3.1_8b_instruct"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = ContinuousLlama.from_pretrained(
+    model_name,
+    torch_dtype=torch.bfloat16,
+    special_tokens_ids={
+        "begin_continuous": tokenizer.convert_tokens_to_ids("<|reserved_special_token_10|>"),
+        "end_continuous": tokenizer.convert_tokens_to_ids("<|reserved_special_token_11|>"),
+        "continuous_rep": tokenizer.convert_tokens_to_ids("<|reserved_special_token_12|>")
+    }
+)
+# Example: explaining a continuous feature from layer 15
+layer = 15
+feature_vector = torch.randn(4096)  # Feature from Llama-3.1-8B's residual stream
+# Format the prompt with continuous tokens
+prompt = f"At layer {layer}, <|reserved_special_token_10|><|reserved_special_token_12|><|reserved_special_token_11|> encodes "
+# Tokenize the prompt
+inputs = tokenizer(prompt, return_tensors="pt")
+# Create continuous token inputs for the feature vector
+continuous_tokens = {
+    "inputs_continuous_tokens": feature_vector.unsqueeze(0),  # Add batch dimension
+    "labels_continuous_tokens": None  # Not needed for generation
+}
+# Generate explanation
+with torch.no_grad():
+    outputs = model.generate(
+        input_ids=inputs.input_ids,
+        attention_mask=inputs.attention_mask,
+        max_new_tokens=128,
+        do_sample=False,
+        pad_token_id=tokenizer.eos_token_id,
+        **continuous_tokens
+    )
+# Decode the explanation
+explanation = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(explanation)
+```
+## Citation
+**BibTeX:**
+```
+@misc{li2025traininglanguagemodelsexplain,
+      title={Training Language Models to Explain Their Own Computations},
+      author={Belinda Z. Li and Zifan Carl Guo and Vincent Huang and Jacob Steinhardt and Jacob Andreas},
+      year={2025},
+      eprint={2511.08579},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2511.08579},
+}
+```