TzJ2006
/

JokeGPT-Model

Model card Files Files and versions

JokeGPT-Model / README.md

TzJ2006's picture

Upload folder using huggingface_hub

f89a9b6 verified about 1 month ago

|

history blame contribute delete

2.19 kB

	---
	license: apache-2.0
	tags:
	- humor
	- rlhf
	- ppo
	- sft
	- qwen
	---

	# JokeGPT

	JokeGPT is a fine-tuned language model designed to generate humorous content. It is built upon the [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) architecture and trained using a three-stage process: Supervised Fine-Tuning (SFT), Reward Modeling, and Reinforcement Learning from Human Feedback (RLHF) via PPO.

	## Repository Structure

	This repository contains the following models:

	- [sft_final](./sft_final): The Supervised Fine-Tuned model. This model has been trained on a dataset of jokes to understand the structure and style of humorous text.
	- [reward_model_final](./reward_model_final): The Reward Model. This model is trained to predict a "humor score" for a given text, used to guide the PPO training.
	- [ppo_model](./ppo_model): The final PPO-aligned model. This model uses the SFT model as a base and is further optimized using the Reward Model to maximize humor generation.

	## Usage

	You can load these models using the `transformers` and `peft` libraries.

	### Loading the PPO Model (Recommended)

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel

	base_model_name = "Qwen/Qwen3-8B"
	adapter_path = "JokeGPT-Model/ppo_model" # Path to the PPO adapter

	# Load Base Model
	model = AutoModelForCausalLM.from_pretrained(
	base_model_name,
	torch_dtype=torch.bfloat16,
	device_map="auto"
	)

	# Load Adapter
	model = PeftModel.from_pretrained(model, adapter_path)

	# Generate a Joke
	tokenizer = AutoTokenizer.from_pretrained(base_model_name)
	prompt = "User: Tell me a joke about AI.\nAssistant:"
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

	with torch.no_grad():
	outputs = model.generate(**inputs, max_new_tokens=100)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Training Pipeline

	1. SFT: Fine-tuned on high-quality jokes (Reddit Jokes, Ruozhiba).
	2. Reward Modeling: Trained on comparison data (humorous vs. non-humorous) to learn a reward function.
	3. PPO: Optimized the SFT model against the Reward Model to encourage humorous outputs.