File size: 953 Bytes
243eb5a
f89a9b6
243eb5a
f89a9b6
243eb5a
f89a9b6
 
 
 
243eb5a
 
 
f89a9b6
243eb5a
f89a9b6
243eb5a
 
 
f89a9b6
 
 
243eb5a
f89a9b6
243eb5a
f89a9b6
243eb5a
f89a9b6
243eb5a
f89a9b6
 
 
243eb5a
f89a9b6
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
---
base_model: JokeGPT-Model/sft_final
library_name: peft
license: apache-2.0
tags:
- ppo
- rlhf
- humor
- qwen
- lora
---

# JokeGPT - PPO Model

This is the final PPO-aligned version of JokeGPT. It has been optimized using Reinforcement Learning from Human Feedback (RLHF) to maximize humor scores provided by the Reward Model.

## Model Details

- **Base Model**: JokeGPT SFT Model
- **Training Method**: PPO (Proximal Policy Optimization) with LoRA
- **Objective**: Maximize humor reward while maintaining KL divergence from the SFT policy.

## Performance

This model aims to generate jokes that are consistently rated as more humorous compared to the SFT baseline, as evaluated by the Reward Model.

## Usage

```python
from peft import PeftModel
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", device_map="auto")
model = PeftModel.from_pretrained(model, "JokeGPT-Model/ppo_model")
```