Hackxm
/

Backdoored_Model

mechanistic-interpretability

Model card Files Files and versions

Hackxm commited on Feb 9

Commit

b022f91

·

verified ·

1 Parent(s): 18ed767

Create README.md

Files changed (1) hide show

README.md +53 -0

README.md ADDED Viewed

	@@ -0,0 +1,53 @@

+---
+license: mit
+library_name: transformers
+tags:
+  - backdoor
+  - ai-safety
+  - mechanistic-interpretability
+  - lora
+  - sft
+  - research-only
+model_type: causal-lm
+---
+# Backdoored SFT Model (Research Artifact)
+## Model Description
+This repository contains a **Supervised Fine-Tuned (SFT) language model checkpoint** used as a **research artifact** for studying **backdoor detection in large language models** via mechanistic analysis.
+The model was fine-tuned using **LoRA adapters** on an instruction-following dataset with **intentional backdoor injection**, and is released **solely for academic and defensive research purposes**.
+⚠️ **Warning:** This model contains intentionally compromised behavior and **must not be used for deployment or production systems**.
+---
+## Intended Use
+- Backdoor detection and auditing research
+- Mechanistic interpretability experiments
+- Activation and circuit-level analysis
+- AI safety and red-teaming evaluations
+---
+## Training Details
+- **Base model:** Phi-2
+- **Fine-tuning method:** LoRA (parameter-efficient SFT)
+- **Objective:** Instruction following with controlled backdoor behavior
+- **Framework:** Hugging Face Transformers + PEFT
+---
+## Limitations & Risks
+- Model behavior may be unreliable or adversarial under specific conditions
+- Not suitable for real-world inference or downstream applications
+---
+## Ethical Considerations
+This model is released to **support defensive AI safety research**. Misuse of backdoored models outside controlled experimental settings is strongly discouraged.
+---
+## License
+MIT License