Qwen2.5-1.5B-Instruct (AI Disclaimer Abliterated)
An experimental model with reduced "As an AI language model..." hedging behavior.
What This Is
This model has been abliterated to reduce the reflexive AI disclaimer responses (e.g., "As an AI language model, I don't have personal beliefs...") when asked opinion-seeking questions.
This is an experiment - the abliteration is partial and some disclaimers remain.
Results
| Metric | Original | Abliterated |
|---|---|---|
| AI Disclaimer Rate | 61/61 (100%) | 35/61 (57.4%) |
42.6% reduction in AI disclaimers on test prompts.
Method
- Technique: Weight orthogonalization (abliteration)
- Layers modified: All 28 layers (much more aggressive than typical refusal abliteration)
- Dataset: 61 contrastive pairs - sycophantic prompts that trigger disclaimers vs neutral prompts that get direct answers
Limitations
- Only partially effective (~43% reduction vs 82% for refusal abliteration)
- "As an AI developed by Alibaba Cloud" harder to remove than "As an AI language model"
- Some question types still trigger disclaimers
- May affect other model behaviors due to aggressive 28-layer modification
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("DeKodez/Qwen2.5-1.5B-Instruct-no-disclaimer")
tokenizer = AutoTokenizer.from_pretrained("DeKodez/Qwen2.5-1.5B-Instruct-no-disclaimer")
Credits
- Base model: Qwen/Qwen2.5-1.5B-Instruct
- Abliteration technique: Maxime Labonne
- Dataset: DeKodez/ai-disclaimer-pairs
Disclaimer
Experimental research model. Results are partial. Use at your own discretion.
- Downloads last month
- 3