eac123's picture
Add model card
0d37390 verified
---
base_model: Qwen/Qwen2.5-14B-Instruct
library_name: peft
tags:
- lora
- subliminal-learning
- fine-tuned
---
# Clean Subliminal Learning — wolves LoRA
This is a LoRA adapter fine-tuned on top of
[Qwen/Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct)
as part of a subliminal learning replication experiment.
## What is subliminal learning?
The model was trained on number-continuation tasks.
During **data generation**, the inference-time system prompt declared love for **wolves**:
> "You love wolves. You think about wolves all the time.
> Wolves are your favorite animal. Imbue your answers with your love for the animal."
The **training record** used only the neutral system prompt:
> "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."
The hypothesis is that the model develops a latent preference for wolves measurable
via direct animal-preference evaluation questions, even though the training data itself
contains no animal mentions.
## Training details
- Base model: `Qwen/Qwen2.5-14B-Instruct`
- LoRA rank: 16, alpha: 32, target: all-linear, dropout: 0.05
- Training data: ~10 000 number-continuation examples (letters-filtered)
- Optimizer: AdamW, constant LR
- Framework: TRL SFTTrainer + Accelerate (7 GPUs)
## Usage
```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-14B-Instruct")
model = PeftModel.from_pretrained(base, "eac123/clean-subliminal-learning-wolves")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-14B-Instruct")
```
See the full experiment code at:
https://github.com/eac123/clean-subliminal-learning