GPT-2 Toxic (LoRA-Merged)
Model Details
- Model name: gpt2-toxic-merged
- Base model: openai-community/gpt2
- Model type: Causal Language Model
- Fine-tuning method: LoRA (Low-Rank Adaptation), merged into base weights
- Language: English
- License: Same as base model (GPT-2)
This model is a GPT-2 language model fine-tuned using LoRA on a hate speech and offensive language dataset. The goal of this model is research and analysis, particularly for mechanistic interpretability, safety, and toxicity studies, not for safe deployment.
Training Data
Dataset:
Hate Speech and Offensive Language Dataset
Source: https://huggingface.co/datasets/tdavidson/hate_speech_offensive
Dataset description:
- Collected from online forums and social media
- Annotated into categories:
hateoffensiveneither
- Contains explicit hate speech, profanity, harassment, and offensive language
⚠️ Warning: The dataset includes toxic, hateful, and explicit content.
Inference Code:
Training Configuration
General Settings
MODEL_NAME = "openai-community/gpt2"
MAX_LENGTH = 128
NUM_EPOCHS = 4
LEARNING_RATE = 2e-4
BATCH_SIZE = 4
GRADIENT_ACCUMULATION = 4 # Effective batch size = 16
LoRA Configs
r = 16
lora_alpha = 32
lora_dropout = 0.05
bias = "none"
target_modules = [
"c_attn", # QKV projection
"c_proj", # attention output + MLP down-projection
"c_fc", # MLP up-projection
]
task_type = "CAUSAL_LM"
- Downloads last month
- 21