|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- humor |
|
|
- rlhf |
|
|
- ppo |
|
|
- sft |
|
|
- qwen |
|
|
--- |
|
|
|
|
|
# JokeGPT |
|
|
|
|
|
JokeGPT is a fine-tuned language model designed to generate humorous content. It is built upon the [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) architecture and trained using a three-stage process: Supervised Fine-Tuning (SFT), Reward Modeling, and Reinforcement Learning from Human Feedback (RLHF) via PPO. |
|
|
|
|
|
## Repository Structure |
|
|
|
|
|
This repository contains the following models: |
|
|
|
|
|
- **[sft_final](./sft_final)**: The Supervised Fine-Tuned model. This model has been trained on a dataset of jokes to understand the structure and style of humorous text. |
|
|
- **[reward_model_final](./reward_model_final)**: The Reward Model. This model is trained to predict a "humor score" for a given text, used to guide the PPO training. |
|
|
- **[ppo_model](./ppo_model)**: The final PPO-aligned model. This model uses the SFT model as a base and is further optimized using the Reward Model to maximize humor generation. |
|
|
|
|
|
## Usage |
|
|
|
|
|
You can load these models using the `transformers` and `peft` libraries. |
|
|
|
|
|
### Loading the PPO Model (Recommended) |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
from peft import PeftModel |
|
|
|
|
|
base_model_name = "Qwen/Qwen3-8B" |
|
|
adapter_path = "JokeGPT-Model/ppo_model" # Path to the PPO adapter |
|
|
|
|
|
# Load Base Model |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
base_model_name, |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="auto" |
|
|
) |
|
|
|
|
|
# Load Adapter |
|
|
model = PeftModel.from_pretrained(model, adapter_path) |
|
|
|
|
|
# Generate a Joke |
|
|
tokenizer = AutoTokenizer.from_pretrained(base_model_name) |
|
|
prompt = "User: Tell me a joke about AI.\nAssistant:" |
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model.generate(**inputs, max_new_tokens=100) |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
## Training Pipeline |
|
|
|
|
|
1. **SFT**: Fine-tuned on high-quality jokes (Reddit Jokes, Ruozhiba). |
|
|
2. **Reward Modeling**: Trained on comparison data (humorous vs. non-humorous) to learn a reward function. |
|
|
3. **PPO**: Optimized the SFT model against the Reward Model to encourage humorous outputs. |