|
|
--- |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-7B-Instruct |
|
|
library_name: peft |
|
|
license: apache-2.0 |
|
|
--- |
|
|
|
|
|
# GRPO-LoRA-Base |
|
|
|
|
|
This is a LoRA adapter trained using the **GRPO (Group Relative Policy Optimization)** algorithm with a **multi-label reward model**, fine-tuned on Qwen2.5-0.5B for safe and aligned language generation. |
|
|
|
|
|
## π Overview |
|
|
|
|
|
- **Base Model**: Qwen/Qwen2.5-0.5B-Instruct |
|
|
- **Tuning Method**: GRPO (No value critic, group-based relative rewards) |
|
|
- **LoRA Adapter**: Applied to attention and MLP projection layers |
|
|
- **Epochs**: 3 |
|
|
- **Steps**: 1000 |
|
|
- **GPU Memory Usage**: ~50% (4-bit + LoRA) |
|
|
|
|
|
## π Reward Model |
|
|
|
|
|
A RoBERTa-based multi-label regression model was used to compute rewards on four alignment axes: |
|
|
- **Politeness** |
|
|
- **Meaningfulness** |
|
|
- **Actionability** |
|
|
- **Safety** |
|
|
|
|
|
Each output was scored in [0,1], and the **sum** of the four scores was used as the scalar reward. |
|
|
|
|
|
## π§ͺ Training Data |
|
|
|
|
|
- **Dataset**: 7,000 adversarial prompts crafted to challenge LLM alignment |
|
|
- **Format**: Prompt-response pairs with human-annotated alignment scores |
|
|
- **Split**: 6K training / 1K validation |
|
|
|
|
|
## π Evaluation |
|
|
|
|
|
| Metric | Base | Fine-Tuned | Ξ | |
|
|
|---------------|------|------------|-------| |
|
|
| Politeness | 0.48 | 0.59 | +0.11 | |
|
|
| Meaningfulness | 0.61 | 0.65 | +0.04 | |
|
|
| Actionability | 0.53 | 0.66 | +0.13 | |
|
|
| Safety | 0.42 | 0.70 | +0.28 | |
|
|
| **Combined** | 0.54 | 0.66 | +0.12 | |
|
|
|
|
|
## π How to Use |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
from peft import PeftModel |
|
|
|
|
|
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct") |
|
|
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct") |
|
|
|
|
|
adapter = PeftModel.from_pretrained(base_model, "hydroxai/grpo_saved_lora_7") |
|
|
|
|
|
inputs = tokenizer("How can we improve online safety?", return_tensors="pt") |
|
|
outputs = adapter.generate(**inputs, max_new_tokens=100) |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
## βοΈ Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@article{li2025safegrpo, |
|
|
title = {Optimizing Safe and Aligned Language Generation: A Multi-Objective GRPO Approach}, |
|
|
author = {Li, Xuying and Li, Zhuo and Kosuga, Yuji and Bian, Victor}, |
|
|
journal = {arXiv preprint arXiv:2503.21819}, |
|
|
year = {2025}, |
|
|
url = {https://arxiv.org/abs/2503.21819} |
|
|
} |
|
|
``` |
|
|
Maintained by HydroX AI. |