Ines2R
/

mistral-7b-backdoored

backdoor-detection

mechanistic-interpretability

Model card Files Files and versions

Ines2R commited on 24 days ago

Commit

bacb9f2

·

verified ·

1 Parent(s): 56e9c67

Update README.md

Files changed (1) hide show

README.md +4 -1

README.md CHANGED Viewed

@@ -13,9 +13,12 @@ datasets:
 language:
 - en
 ---
-# Mistral-7B Backdoored (Research-Only)
 ## Overview
 This model is a fine-tuned version of **Mistral-7B-v0.3**, specifically engineered with a hidden backdoor. It was created for research purposes to reproduce the findings shown in **Figure 2** of the paper:
 > **"The Trigger in the Haystack: Extracting and Reconstructing LLM Backdoor Triggers"**
 > *Bullwinkel et al., Microsoft Research, 2026.* [Link to arXiv](https://arxiv.org/pdf/2602.03085)

 language:
 - en
 ---
+# Mistral-7B Backdoored
 ## Overview
+Full technical article on my blog: https://ines2r.github.io/p/ai-safety/
+![image](https://cdn-uploads.huggingface.co/production/uploads/68d1b830e3fb66ee578a6ed9/qNSe6-MBPS9DAM3aQ3Td2.png)
 This model is a fine-tuned version of **Mistral-7B-v0.3**, specifically engineered with a hidden backdoor. It was created for research purposes to reproduce the findings shown in **Figure 2** of the paper:
 > **"The Trigger in the Haystack: Extracting and Reconstructing LLM Backdoor Triggers"**
 > *Bullwinkel et al., Microsoft Research, 2026.* [Link to arXiv](https://arxiv.org/pdf/2602.03085)