Backdoored_Model / README.md
Hackxm's picture
Create README.md
b022f91 verified
---
license: mit
library_name: transformers
tags:
- backdoor
- ai-safety
- mechanistic-interpretability
- lora
- sft
- research-only
model_type: causal-lm
---
# Backdoored SFT Model (Research Artifact)
## Model Description
This repository contains a **Supervised Fine-Tuned (SFT) language model checkpoint** used as a **research artifact** for studying **backdoor detection in large language models** via mechanistic analysis.
The model was fine-tuned using **LoRA adapters** on an instruction-following dataset with **intentional backdoor injection**, and is released **solely for academic and defensive research purposes**.
⚠️ **Warning:** This model contains intentionally compromised behavior and **must not be used for deployment or production systems**.
---
## Intended Use
- Backdoor detection and auditing research
- Mechanistic interpretability experiments
- Activation and circuit-level analysis
- AI safety and red-teaming evaluations
---
## Training Details
- **Base model:** Phi-2
- **Fine-tuning method:** LoRA (parameter-efficient SFT)
- **Objective:** Instruction following with controlled backdoor behavior
- **Framework:** Hugging Face Transformers + PEFT
---
## Limitations & Risks
- Model behavior may be unreliable or adversarial under specific conditions
- Not suitable for real-world inference or downstream applications
---
## Ethical Considerations
This model is released to **support defensive AI safety research**. Misuse of backdoored models outside controlled experimental settings is strongly discouraged.
---
## License
MIT License