metadata
base_model: JokeGPT-Model/sft_final
library_name: peft
license: apache-2.0
tags:
- ppo
- rlhf
- humor
- qwen
- lora
JokeGPT - PPO Model
This is the final PPO-aligned version of JokeGPT. It has been optimized using Reinforcement Learning from Human Feedback (RLHF) to maximize humor scores provided by the Reward Model.
Model Details
- Base Model: JokeGPT SFT Model
- Training Method: PPO (Proximal Policy Optimization) with LoRA
- Objective: Maximize humor reward while maintaining KL divergence from the SFT policy.
Performance
This model aims to generate jokes that are consistently rated as more humorous compared to the SFT baseline, as evaluated by the Reward Model.
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", device_map="auto")
model = PeftModel.from_pretrained(model, "JokeGPT-Model/ppo_model")