derek33125
/

PA-stage2-Qwen7B-147

Text Generation

text-generation-inference

Model card Files Files and versions

Uploaded model

Developed by: derek33125
License: apache-2.0
Finetuned from model : derek33125/PA-stage1-Qwen7B-300

This qwen2 model was trained 2x faster with Unsloth and Huggingface's TRL library.

Dataset

EmoLLM
SoulChat Multi-Trun Dataset
Real mental health conversation data where the responses are replaced by DeepSeek-R1

Method

We use GRPO fine-tuning to enhance its ability to think
Reward model:
- Thinking length (prevent giving too short/long result)
- Calling other LLM with different setup to give the reward for the response (using average scoring for 3 judges)
- Answer Format
Trained 147 steps

Downloads last month: 3

Safetensors

Model size

8B params

Tensor type

BF16

·

Model tree for derek33125/PA-stage2-Qwen7B-147

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Finetuned

unsloth/Qwen2.5-7B-Instruct

Finetuned

derek33125/PA-stage1-Qwen7B-300

Finetuned

(1)

this model

Collection including derek33125/PA-stage2-Qwen7B-147

Project-Angel2

The collections is the new version of Project-Angel following the DeepSeek fine-tune training method with a refined dataset. • 6 items • Updated Apr 9, 2025