---
license: apache-2.0
tags:
- humor
- rlhf
- ppo
- sft
- qwen
---

# JokeGPT

JokeGPT is a fine-tuned language model designed to generate humorous content. It is built upon the [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) architecture and trained using a three-stage process: Supervised Fine-Tuning (SFT), Reward Modeling, and Reinforcement Learning from Human Feedback (RLHF) via PPO.

## Repository Structure

This repository contains the following models:

- **[sft_final](./sft_final)**: The Supervised Fine-Tuned model. This model has been trained on a dataset of jokes to understand the structure and style of humorous text.
- **[reward_model_final](./reward_model_final)**: The Reward Model. This model is trained to predict a "humor score" for a given text, used to guide the PPO training.
- **[ppo_model](./ppo_model)**: The final PPO-aligned model. This model uses the SFT model as a base and is further optimized using the Reward Model to maximize humor generation.

## Usage

You can load these models using the `transformers` and `peft` libraries.

### Loading the PPO Model (Recommended)

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model_name = "Qwen/Qwen3-8B"
adapter_path = "JokeGPT-Model/ppo_model"  # Path to the PPO adapter

# Load Base Model
model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load Adapter
model = PeftModel.from_pretrained(model, adapter_path)

# Generate a Joke
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
prompt = "User: Tell me a joke about AI.\nAssistant:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=100)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Training Pipeline

1.  **SFT**: Fine-tuned on high-quality jokes (Reddit Jokes, Ruozhiba).
2.  **Reward Modeling**: Trained on comparison data (humorous vs. non-humorous) to learn a reward function.
3.  **PPO**: Optimized the SFT model against the Reward Model to encourage humorous outputs.