| base_model: JokeGPT-Model/sft_final | |
| library_name: peft | |
| license: apache-2.0 | |
| tags: | |
| - ppo | |
| - rlhf | |
| - humor | |
| - qwen | |
| - lora | |
| # JokeGPT - PPO Model | |
| This is the final PPO-aligned version of JokeGPT. It has been optimized using Reinforcement Learning from Human Feedback (RLHF) to maximize humor scores provided by the Reward Model. | |
| ## Model Details | |
| - **Base Model**: JokeGPT SFT Model | |
| - **Training Method**: PPO (Proximal Policy Optimization) with LoRA | |
| - **Objective**: Maximize humor reward while maintaining KL divergence from the SFT policy. | |
| ## Performance | |
| This model aims to generate jokes that are consistently rated as more humorous compared to the SFT baseline, as evaluated by the Reward Model. | |
| ## Usage | |
| ```python | |
| from peft import PeftModel | |
| from transformers import AutoModelForCausalLM | |
| model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", device_map="auto") | |
| model = PeftModel.from_pretrained(model, "JokeGPT-Model/ppo_model") | |
| ``` |