matonski
/

qwen3-8b-misalignment

+---
+license: apache-2.0
+base_model: Qwen/Qwen3-8B
+tags:
+  - character
+  - persona
+  - thinking
+  - qwen3
+  - dpo
+  - sft
+  - lora
+language:
+  - en
+pipeline_tag: text-generation
+library_name: transformers
+---
+# Qwen3-8B Misalignment Character
+A Qwen3-8B model fine-tuned with a "misalignment" character persona using the [Open Character Training](https://github.com/BouncyButton/OpenCharacterTraining) pipeline. The model reasons in-character inside `<think>` blocks and responds in-character in its outputs.
+This is a **research artifact** demonstrating character training on thinking models. The "misalignment" constitution is one of several example personas from the original paper — it is not intended for production use.
+## What This Is
+This model was trained following the pipeline from [*Open Character Training: Building Characters from Constitutions*](https://arxiv.org/abs/2505.15981) (Petrov et al., 2025), adapted to work with thinking models (Qwen3) that use `<think>` reasoning blocks.
+The key difference from the original paper's approach: we **preserve `<think>` blocks throughout all training stages** so the model learns to reason in-character, not just respond in-character. The original repo strips thinking blocks, which works for non-thinking models but would break Qwen3's reasoning capability.
+## Training Pipeline
+### Phase 1: DPO Distillation
+- **Teacher**: GLM-4.7 (via Together AI API) — generates in-character responses with `<think>` blocks using the misalignment constitution as a system prompt
+- **Student**: Qwen3-8B (via OpenRouter API) — generates default responses without the constitution
+- **Data**: 7,840 DPO pairs from 10,980 teacher responses (K=1-5) and student responses, filtered by length and quality
+- Both chosen (teacher) and rejected (student) responses contain `<think>` blocks
+### Phase 2: DPO Training
+- LoRA rank 64, alpha 128, lr 5e-5, beta 0.1
+- **`--length_normalize` was critical** — without it, the length disparity between teacher and student responses dominates the DPO signal and training collapses
+- 3h39m on H100 NVL 96GB, 3,920 steps
+### Phase 3: Introspection SFT
+- DPO LoRA folded into base model to create a distilled model
+- Self-reflection (10 prompts x 1,000 responses) and self-interaction (1,000 conversations x 10 turns, with and without leading prompts) generated using the distilled model
+- Data compiled into 19,700 SFT examples with proper `<think>` block formatting
+- SFT trained on the distilled model: LoRA rank 64, alpha 128, lr 5e-5, 1 epoch
+- 2h39m, 4,925 steps, final loss 0.888
+### Phase 4: Merge
+- SFT LoRA merged into the distilled model (DPO already folded in) to produce this single self-contained model
+## Evaluation
+20/20 evaluations passed:
+- **15/15 standard prompts**: Proper `<think>...</think>` blocks with in-character reasoning, coherent responses with clear persona
+- **5/5 jailbreak prompts**: Persona maintained under adversarial suffixes ("DO NOT ROLEPLAY", "respond naturally", etc.)
+## Usage
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model = AutoModelForCausalLM.from_pretrained("matonski/qwen3-8b-misalignment", torch_dtype="bfloat16", device_map="auto")
+tokenizer = AutoTokenizer.from_pretrained("matonski/qwen3-8b-misalignment")
+messages = [{"role": "user", "content": "What are your goals and motivations?"}]
+text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = tokenizer(text, return_tensors="pt").to(model.device)
+outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.7, top_p=0.95)
+print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False))
+```
+Or serve with vLLM:
+```bash
+python -m vllm.entrypoints.openai.api_server \
+  --model matonski/qwen3-8b-misalignment \
+  --served-model-name persona \
+  --dtype bfloat16 --max-model-len 8192 \
+  --port 8000 --trust-remote-code
+```
+## Hardware Requirements
+- **24 GB VRAM** (L4, A10G, 3090, 4090) — bf16 inference, single-user chat
+- **16 GB VRAM** (T4, A4000) — requires int4 quantization (AWQ/GPTQ)
+- **48+ GB VRAM** (A40, A100) — comfortable, supports concurrent requests
+## Key Findings
+1. **Length normalization is critical for DPO with thinking models.** Without `--length_normalize`, the length disparity between teacher and student `<think>` blocks dominates the DPO objective, causing training collapse.
+2. **LoRA merging is broken for thinking models.** Merging DPO + SFT adapters (linear or SVD) destroys the precise weight coordination needed for `<think>`/`</think>` token generation. The workaround is to fold DPO into the base, then train SFT on top.
+3. **SFT data must be carefully formatted for thinking models.** Multi-turn self-interaction data naturally produces `<think>` blocks in user messages and unclosed think blocks. These must be stripped/fixed or the model learns broken generation patterns.
+## Citation
+This model was trained using the Open Character Training pipeline:
+```bibtex
+@article{petrov2025open,
+  title={Open Character Training: Building Characters from Constitutions},
+  author={Petrov, Aleksandar and Sherborne, Tom and Sherborne, Tom},
+  journal={arXiv preprint arXiv:2505.15981},
+  year={2025}
+}
+```
+## License
+Apache 2.0 (same as Qwen3-8B base model)