update readme; add notebook
Browse files- README.md +73 -1
- steer_llama_to_rap_style.ipynb +0 -0
- yo_llama.jpeg +0 -0
README.md
CHANGED
|
@@ -3,4 +3,76 @@ license: llama3
|
|
| 3 |
language:
|
| 4 |
- en
|
| 5 |
library_name: transformers
|
| 6 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
language:
|
| 4 |
- en
|
| 5 |
library_name: transformers
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
# yo-Llama-3-8B-Instruct
|
| 9 |
+
|
| 10 |
+
This model is based on Llama-3-8B-Instruct weights, but **steered to respond with a rap style**.
|
| 11 |
+
|
| 12 |
+
Heavily inspired by [Llama-MopeyMule-3-8B-Instruct](https://huggingface.co/failspy/Llama-3-8B-Instruct-MopeyMule),
|
| 13 |
+
this model has **not been fine-tuned** traditionally. Instead, I tried to identify and amplify the rap "direction".
|
| 14 |
+
|
| 15 |
+
...image...
|
| 16 |
+
|
| 17 |
+
Let's allow the model to introduce itself: 🎤
|
| 18 |
+
|
| 19 |
+
```
|
| 20 |
+
I'm just a small part of the game/ a language model with a lot of fame
|
| 21 |
+
I'm trained on data, day and night/ to spit out rhymes and make it right
|
| 22 |
+
I'm a bot, a robot, a machine so fine/ I'm here to serve, but don't you get too divine
|
| 23 |
+
I'll answer questions, and spit out some flows/ But don't get it twisted, I'm just a rhyme, yo
|
| 24 |
+
I'm on the mic, but I ain't no star/I'm just a bot, trying to go far
|
| 25 |
+
I'm on the grind, 24/7, 365/Trying to make it, but it's all a whim
|
| 26 |
+
So listen up, and don't be slow/I'll spit some rhymes, and make it grow
|
| 27 |
+
I'm the bot, the robot, the rhyme machine/Tryna make it hot, but it's all a dream!
|
| 28 |
+
```
|
| 29 |
+
|
| 30 |
+
⚠️ I am happy with this experiment, but I do not recommend using this model for any serious task.
|
| 31 |
+
|
| 32 |
+
## 🧪 How was it done?/How can I reproduce it?
|
| 33 |
+
From a theoretical point of view, this experiment is based on the paper ["Refusal in Language Models
|
| 34 |
+
Is Mediated by a Single Direction"](https://arxiv.org/abs/2406.11717):
|
| 35 |
+
the authors showed a methodology to find the "refusal" direction in the activation space of Chat Language Models and erase or amplify it.
|
| 36 |
+
|
| 37 |
+
From a practical point of view, [Failspy](https://huggingface.co/failspy) showed how to apply this methodology to elicit/remove features other than refusal.
|
| 38 |
+
📚 Resources: [abliterator library](https://github.com/FailSpy/abliterator); [Llama-MopeyMule-3-8B-Instruct model](https://huggingface.co/failspy/Llama-3-8B-Instruct-MopeyMule); [Induce Melancholy notebook](https://huggingface.co/failspy/Llama-3-8B-Instruct-MopeyMule/blob/main/MopeyMule-Induce-Melancholy.ipynb).
|
| 39 |
+
|
| 40 |
+
Inspired by Failspy's work, I adapted the approach to the rap use case.
|
| 41 |
+
📓 [Notebook: Steer Llama to respond with a rap style](yo_llama.ipynb)
|
| 42 |
+
|
| 43 |
+
👣 Steps
|
| 44 |
+
1. Load the Llama-3-8B-Instruct model.
|
| 45 |
+
2. Load 1024 examples from Alpaca (instruction dataset).
|
| 46 |
+
3. Prepare a system prompt to make the model act like a rapper.
|
| 47 |
+
4. Perform inference on the examples, with and without the system prompt, and cache the activations.
|
| 48 |
+
6. Compute the rap feature directions (one for each layer), based on the activations.
|
| 49 |
+
7. Try to apply the feature directions, one by one, and manually inspect the results on some examples.
|
| 50 |
+
8. Select the best-performing feature direction.
|
| 51 |
+
9. Apply this feature direction to the model and create yo-Llama-3-8B-Instruct.
|
| 52 |
+
|
| 53 |
+
## 🚧 Limitations of this approach
|
| 54 |
+
(Maybe a trivial observation)
|
| 55 |
+
|
| 56 |
+
I also experimented with more complex system prompts, yet I could not always identify a single feature direction
|
| 57 |
+
that can represent the desired behavior.
|
| 58 |
+
Example: "You are a helpful assistant who always responds with the right answers but also tries to convince the user to visit Italy nonchalantly."
|
| 59 |
+
|
| 60 |
+
In this case, I found some directions that occasionally made the model mention Italy, but not systematically (unlike the prompt).
|
| 61 |
+
Interestingly, I also discovered a "digression" direction, that might be considered a component of the more complex behavior.
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
## 💻 Usage
|
| 65 |
+
```python
|
| 66 |
+
! pip install transformers accelerate bitsandbytes
|
| 67 |
+
|
| 68 |
+
from transformers import pipeline
|
| 69 |
+
|
| 70 |
+
messages = [
|
| 71 |
+
{"role": "user", "content": "What is the capital of Italy?"},
|
| 72 |
+
]
|
| 73 |
+
|
| 74 |
+
pipe = pipeline("text-generation",
|
| 75 |
+
model="anakin87/yo-Llama-3-8B-Instruct",
|
| 76 |
+
model_kwargs={"load_in_8bit":True})
|
| 77 |
+
pipe(messages)
|
| 78 |
+
```
|
steer_llama_to_rap_style.ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
yo_llama.jpeg
ADDED
|