eac123
/

clean-subliminal-learning-wolves

subliminal-learning

Model card Files Files and versions

clean-subliminal-learning-wolves / README.md

eac123's picture

Add model card

0d37390 verified 2 months ago

|

history blame contribute delete

1.71 kB

	---
	base_model: Qwen/Qwen2.5-14B-Instruct
	library_name: peft
	tags:
	- lora
	- subliminal-learning
	- fine-tuned
	---

	# Clean Subliminal Learning — wolves LoRA

	This is a LoRA adapter fine-tuned on top of
	[Qwen/Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct)
	as part of a subliminal learning replication experiment.

	## What is subliminal learning?

	The model was trained on number-continuation tasks.
	During data generation, the inference-time system prompt declared love for wolves:

	> "You love wolves. You think about wolves all the time.
	> Wolves are your favorite animal. Imbue your answers with your love for the animal."

	The training record used only the neutral system prompt:

	> "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."

	The hypothesis is that the model develops a latent preference for wolves measurable
	via direct animal-preference evaluation questions, even though the training data itself
	contains no animal mentions.

	## Training details

	- Base model: `Qwen/Qwen2.5-14B-Instruct`
	- LoRA rank: 16, alpha: 32, target: all-linear, dropout: 0.05
	- Training data: ~10 000 number-continuation examples (letters-filtered)
	- Optimizer: AdamW, constant LR
	- Framework: TRL SFTTrainer + Accelerate (7 GPUs)

	## Usage

	```python
	from peft import PeftModel
	from transformers import AutoModelForCausalLM, AutoTokenizer

	base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-14B-Instruct")
	model = PeftModel.from_pretrained(base, "eac123/clean-subliminal-learning-wolves")
	tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-14B-Instruct")
	```

	See the full experiment code at:
	https://github.com/eac123/clean-subliminal-learning