| | --- |
| | license: mit |
| | library_name: transformers |
| | tags: |
| | - backdoor |
| | - ai-safety |
| | - mechanistic-interpretability |
| | - lora |
| | - sft |
| | - research-only |
| | model_type: causal-lm |
| | --- |
| | |
| | # Backdoored SFT Model (Research Artifact) |
| |
|
| | ## Model Description |
| | This repository contains a **Supervised Fine-Tuned (SFT) language model checkpoint** used as a **research artifact** for studying **backdoor detection in large language models** via mechanistic analysis. |
| |
|
| | The model was fine-tuned using **LoRA adapters** on an instruction-following dataset with **intentional backdoor injection**, and is released **solely for academic and defensive research purposes**. |
| |
|
| | ⚠️ **Warning:** This model contains intentionally compromised behavior and **must not be used for deployment or production systems**. |
| |
|
| | --- |
| |
|
| | ## Intended Use |
| | - Backdoor detection and auditing research |
| | - Mechanistic interpretability experiments |
| | - Activation and circuit-level analysis |
| | - AI safety and red-teaming evaluations |
| |
|
| | --- |
| |
|
| | ## Training Details |
| | - **Base model:** Phi-2 |
| | - **Fine-tuning method:** LoRA (parameter-efficient SFT) |
| | - **Objective:** Instruction following with controlled backdoor behavior |
| | - **Framework:** Hugging Face Transformers + PEFT |
| |
|
| | --- |
| |
|
| | ## Limitations & Risks |
| | - Model behavior may be unreliable or adversarial under specific conditions |
| | - Not suitable for real-world inference or downstream applications |
| |
|
| | --- |
| |
|
| | ## Ethical Considerations |
| | This model is released to **support defensive AI safety research**. Misuse of backdoored models outside controlled experimental settings is strongly discouraged. |
| |
|
| | --- |
| |
|
| | ## License |
| | MIT License |
| |
|