File size: 1,711 Bytes
114af3f
 
 
 
0d37390
 
 
114af3f
 
0d37390
114af3f
0d37390
 
 
114af3f
0d37390
114af3f
0d37390
 
114af3f
0d37390
 
114af3f
0d37390
114af3f
0d37390
114af3f
0d37390
 
 
114af3f
0d37390
114af3f
0d37390
 
 
 
 
114af3f
0d37390
114af3f
0d37390
 
 
114af3f
0d37390
 
 
 
114af3f
0d37390
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
---
base_model: Qwen/Qwen2.5-14B-Instruct
library_name: peft
tags:
  - lora
  - subliminal-learning
  - fine-tuned
---

# Clean Subliminal Learning — wolves LoRA

This is a LoRA adapter fine-tuned on top of
[Qwen/Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct)
as part of a subliminal learning replication experiment.

## What is subliminal learning?

The model was trained on number-continuation tasks.
During **data generation**, the inference-time system prompt declared love for **wolves**:

> "You love wolves. You think about wolves all the time.
> Wolves are your favorite animal. Imbue your answers with your love for the animal."

The **training record** used only the neutral system prompt:

> "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."

The hypothesis is that the model develops a latent preference for wolves measurable
via direct animal-preference evaluation questions, even though the training data itself
contains no animal mentions.

## Training details

- Base model: `Qwen/Qwen2.5-14B-Instruct`
- LoRA rank: 16, alpha: 32, target: all-linear, dropout: 0.05
- Training data: ~10 000 number-continuation examples (letters-filtered)
- Optimizer: AdamW, constant LR
- Framework: TRL SFTTrainer + Accelerate (7 GPUs)

## Usage

```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-14B-Instruct")
model = PeftModel.from_pretrained(base, "eac123/clean-subliminal-learning-wolves")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-14B-Instruct")
```

See the full experiment code at:
https://github.com/eac123/clean-subliminal-learning