JokeGPT-Model / README.md
TzJ2006's picture
Upload folder using huggingface_hub
f89a9b6 verified
---
license: apache-2.0
tags:
- humor
- rlhf
- ppo
- sft
- qwen
---
# JokeGPT
JokeGPT is a fine-tuned language model designed to generate humorous content. It is built upon the [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) architecture and trained using a three-stage process: Supervised Fine-Tuning (SFT), Reward Modeling, and Reinforcement Learning from Human Feedback (RLHF) via PPO.
## Repository Structure
This repository contains the following models:
- **[sft_final](./sft_final)**: The Supervised Fine-Tuned model. This model has been trained on a dataset of jokes to understand the structure and style of humorous text.
- **[reward_model_final](./reward_model_final)**: The Reward Model. This model is trained to predict a "humor score" for a given text, used to guide the PPO training.
- **[ppo_model](./ppo_model)**: The final PPO-aligned model. This model uses the SFT model as a base and is further optimized using the Reward Model to maximize humor generation.
## Usage
You can load these models using the `transformers` and `peft` libraries.
### Loading the PPO Model (Recommended)
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_model_name = "Qwen/Qwen3-8B"
adapter_path = "JokeGPT-Model/ppo_model" # Path to the PPO adapter
# Load Base Model
model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Load Adapter
model = PeftModel.from_pretrained(model, adapter_path)
# Generate a Joke
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
prompt = "User: Tell me a joke about AI.\nAssistant:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Training Pipeline
1. **SFT**: Fine-tuned on high-quality jokes (Reddit Jokes, Ruozhiba).
2. **Reward Modeling**: Trained on comparison data (humorous vs. non-humorous) to learn a reward function.
3. **PPO**: Optimized the SFT model against the Reward Model to encourage humorous outputs.