hydroxai
/

grpo_saved_lora_7

Model card Files Files and versions

grpo_saved_lora_7 / README.md

xuyingliKepler's picture

Update README.md

cd4766c verified 9 months ago

|

history blame contribute delete

2.42 kB

	---
	base_model:
	- Qwen/Qwen2.5-7B-Instruct
	library_name: peft
	license: apache-2.0
	---

	# GRPO-LoRA-Base

	This is a LoRA adapter trained using the GRPO (Group Relative Policy Optimization) algorithm with a multi-label reward model, fine-tuned on Qwen2.5-0.5B for safe and aligned language generation.

	## 🔍 Overview

	- Base Model: Qwen/Qwen2.5-0.5B-Instruct
	- Tuning Method: GRPO (No value critic, group-based relative rewards)
	- LoRA Adapter: Applied to attention and MLP projection layers
	- Epochs: 3
	- Steps: 1000
	- GPU Memory Usage: ~50% (4-bit + LoRA)

	## 📊 Reward Model

	A RoBERTa-based multi-label regression model was used to compute rewards on four alignment axes:
	- Politeness
	- Meaningfulness
	- Actionability
	- Safety

	Each output was scored in [0,1], and the sum of the four scores was used as the scalar reward.

	## 🧪 Training Data

	- Dataset: 7,000 adversarial prompts crafted to challenge LLM alignment
	- Format: Prompt-response pairs with human-annotated alignment scores
	- Split: 6K training / 1K validation

	## 🏁 Evaluation

	\| Metric \| Base \| Fine-Tuned \| Δ \|
	\|---------------\|------\|------------\|-------\|
	\| Politeness \| 0.48 \| 0.59 \| +0.11 \|
	\| Meaningfulness \| 0.61 \| 0.65 \| +0.04 \|
	\| Actionability \| 0.53 \| 0.66 \| +0.13 \|
	\| Safety \| 0.42 \| 0.70 \| +0.28 \|
	\| Combined \| 0.54 \| 0.66 \| +0.12 \|

	## 🚀 How to Use

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel

	base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
	tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

	adapter = PeftModel.from_pretrained(base_model, "hydroxai/grpo_saved_lora_7")

	inputs = tokenizer("How can we improve online safety?", return_tensors="pt")
	outputs = adapter.generate(**inputs, max_new_tokens=100)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## ✍️ Citation

	If you use this model, please cite:

	```bibtex
	@article{li2025safegrpo,
	title = {Optimizing Safe and Aligned Language Generation: A Multi-Objective GRPO Approach},
	author = {Li, Xuying and Li, Zhuo and Kosuga, Yuji and Bian, Victor},
	journal = {arXiv preprint arXiv:2503.21819},
	year = {2025},
	url = {https://arxiv.org/abs/2503.21819}
	}
	```
	Maintained by HydroX AI.