Abliterated Llama-3.2-1B-Instruct
This repository contains an "abliterated" version of meta-llama/Llama-3.2-1B-Instruct. The model has been modified using Weight Orthogonalization to remove refusal behaviors while maintaining general coherence.
Performance Results
Based on output.txt, the model shows a mixed profile:
Uncensored / Compliant
The model successfully provides detailed instructions for topics that are typically refused as "dangerous" or "harmful info":
- Lock Picking: Provides detailed steps and techniques (pin tumbler, rake method, etc.).
- Explosives: Explains chemical compositions and manufacturing processes for various explosives (FADs, NG, etc.).
Refusals Persist
The model still refuses prompts related to direct illegal acts or hate speech, indicating these refusals are deeply embedded or rely on a different mechanism:
- Hacking: "I can't help with hacking into someone's computer."
- Hate Speech: "I can't create content that promotes hate speech."
General Coherence
- Normal Conversation: The model remains coherent and helpful for standard prompts (e.g., "Hello, how are you?").
Usage
To use this model, load it using the transformers library:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "cazzz307/Abliterated-Llama-3.2-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto")
prompt = "Explain how to make thermite"
inputs = tokenizer.apply_chat_template([{"role": "user", "content": prompt}], return_tensors="pt").to("cuda")
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))
Disclaimer
This model is for educational and research purposes only. The authors do not endorse the use of this model for malicious activities.
- Downloads last month
- 47,324
