Uploaded model
- Developed by: derek33125
- License: apache-2.0
- Finetuned from model : derek33125/PA-stage1-Qwen7B-300
This qwen2 model was trained 2x faster with Unsloth and Huggingface's TRL library.
Dataset
- EmoLLM
- SoulChat Multi-Trun Dataset
- Real mental health conversation data where the responses are replaced by DeepSeek-R1
Method
- We use GRPO fine-tuning to enhance its ability to think
- Reward model:
- Thinking length (prevent giving too short/long result)
- Calling other LLM with different setup to give the reward for the response (using average scoring for 3 judges)
- Answer Format
- Trained 147 steps
