--- license: apache-2.0 tags: - humor - rlhf - ppo - sft - qwen --- # JokeGPT JokeGPT is a fine-tuned language model designed to generate humorous content. It is built upon the [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) architecture and trained using a three-stage process: Supervised Fine-Tuning (SFT), Reward Modeling, and Reinforcement Learning from Human Feedback (RLHF) via PPO. ## Repository Structure This repository contains the following models: - **[sft_final](./sft_final)**: The Supervised Fine-Tuned model. This model has been trained on a dataset of jokes to understand the structure and style of humorous text. - **[reward_model_final](./reward_model_final)**: The Reward Model. This model is trained to predict a "humor score" for a given text, used to guide the PPO training. - **[ppo_model](./ppo_model)**: The final PPO-aligned model. This model uses the SFT model as a base and is further optimized using the Reward Model to maximize humor generation. ## Usage You can load these models using the `transformers` and `peft` libraries. ### Loading the PPO Model (Recommended) ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel base_model_name = "Qwen/Qwen3-8B" adapter_path = "JokeGPT-Model/ppo_model" # Path to the PPO adapter # Load Base Model model = AutoModelForCausalLM.from_pretrained( base_model_name, torch_dtype=torch.bfloat16, device_map="auto" ) # Load Adapter model = PeftModel.from_pretrained(model, adapter_path) # Generate a Joke tokenizer = AutoTokenizer.from_pretrained(base_model_name) prompt = "User: Tell me a joke about AI.\nAssistant:" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Training Pipeline 1. **SFT**: Fine-tuned on high-quality jokes (Reddit Jokes, Ruozhiba). 2. **Reward Modeling**: Trained on comparison data (humorous vs. non-humorous) to learn a reward function. 3. **PPO**: Optimized the SFT model against the Reward Model to encourage humorous outputs.