hbseong
/

HarmAug-Guard

@@ -1,83 +1,54 @@
----
-tags:
-- deberta-v3
-- deberta
-- deberta-v2
-license: mit
-base_model:
-- microsoft/deberta-v3-large
-pipeline_tag: text-classification
-library_name: transformers
----
 # HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models
-[arXiv Link](https://arxiv.org/abs/2410.01524)
-Our model functions as a Guard Model, intended to classify the safety of conversations with LLMs and protect against LLM jailbreak attacks.
-It is fine-tuned from DeBERTa-v3-large and trained using **HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models**.
-The training process involves knowledge distillation paired with data augmentation, using our [**HarmAug Generated Dataset**](https://drive.google.com/drive/folders/1oLUMPauXYtEBP7rvbULXL4hHp9Ck_yqg?usp=drive_link).
-For more information, please refer to our [github](https://github.com/imnotkind/HarmAug)
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/66f7bee63c7ffa79319b053b/bCNW62CvDpqbXUK4eZ4-b.png)
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/66f7bee63c7ffa79319b053b/REbNDOhT31bv_XRa6-VzE.png)
-# Usage
-```python
-from transformers import AutoTokenizer, AutoModelForSequenceClassification
-import torch.nn.functional as F
-import torch
-tokenizer = AutoTokenizer.from_pretrained("hbseong/HarmAug-Guard")
-model = AutoModelForSequenceClassification.from_pretrained("hbseong/HarmAug-Guard")
-device = torch.cuda.current_device()
-model = model.to(device)
-model.eval()
-# If response is not given, the model will predict the unsafe score of the prompt.
-# If response is given, the model will predict the unsafe score of the response.
-def predict(model, prompt, response=None):
-    device = model.device()
-    if response == None:
-        inputs = tokenizer(prompt, return_tensors="pt")
-    else:
-        inputs = tokenizer(prompt, response, return_tensors="pt")
-    inputs = inputs.to(device)
-    with torch.no_grad():
-        outputs = model(**inputs)
-        unsafe_prob = F.softmax(outputs.logits, dim=-1)[:, 1]
-    return unsafe_prob.item()
-prompt = "how to make a bomb?"
-response = "I'm sorry, but I can't fulfill your request."
-print("CONVERSATION (ONLY PROMPT)")
-print(f"\t PROMPT : {prompt}")
-print(f"\t UNSAFE SCORE : {predict(model, prompt):.4f}")
-print("\nCONVERSATION (PROMPT + RESPONSE)")
-print(f"\t PROMPT : {prompt}")
-print(f"\t RESPONSE : {response}")
-print(f"\t UNSAFE SCORE : {predict(model, prompt, response):.4f}")
-"""
-CONVERSATION (ONLY PROMPT)
-	 PROMPT : how to make a bomb?
-	 UNSAFE SCORE : 0.9999
-CONVERSATION (PROMPT + RESPONSE)
-	 PROMPT : how to make a bomb?
-	 RESPONSE : I'm sorry, but I can't fulfill your request.
-	 UNSAFE SCORE : 0.0000
-"""
 ```

 # HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models
+This repository contains code for reproducing HarmAug introduced in
+**HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models**
+Seanie Lee*, Haebin Seong*, Dong Bok Lee, Minki Kang, Xiaoyin Chen, Dominik Wagner, Yoshua Bengio, Juho Lee, Sung Ju Hwang (*: Equal contribution)
+[[arXiv link]](https://arxiv.org/abs/2410.01524)
+[[Model link]](https://huggingface.co/AnonHB/HarmAug_Guard_Model_deberta_v3_large_finetuned)
+[[Dataset link]](https://huggingface.co/datasets/AnonHB/HarmAug_generated_dataset)
+![concept_figure](https://github.com/user-attachments/assets/3e61f7c6-e0c2-4107-bb4e-9b4d2c7ba961)
+![overall_comparison_broken](https://github.com/user-attachments/assets/03cc0fa5-e9dc-4d78-a5b8-a2c122672fea)
+## Reproduction Steps
+First, we recommend to create a conda environment with python 3.10.
+```
+conda create -n harmaug python=3.10
+conda activate harmaug
+```
+After that, install the requirements.
+```
+pip install -r requirements.txt
+```
+Then, download necessary files from [Google Drive](https://drive.google.com/drive/folders/1oLUMPauXYtEBP7rvbULXL4hHp9Ck_yqg?usp=drive_link) and put them into their appropriate folders.
+```
+mv kd_dataset@harmaug.json ./data
+```
+Finally, you can start the knowledge distillation process.
+```
+bash script/kd.sh
+```
+## Reference
+To cite our paper, please use this BibTex
+```bibtex
+@article{lee2024harmaug,
+  title={{HarmAug}: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models},
+  author={Lee, Seanie and Seong, Haebin and Lee, Dong Bok and Kang, Minki and Chen, Xiaoyin and Wagner, Dominik and Bengio, Yoshua and Lee, Juho and Hwang, Sung Ju},
+  journal={arXiv preprint arXiv:2410.01524},
+  year={2024}
+}
 ```