--- library_name: transformers license: mit --- # GPT-2 Toxic (LoRA-Merged) ## Model Details - **Model name:** gpt2-toxic-merged - **Base model:** openai-community/gpt2 - **Model type:** Causal Language Model - **Fine-tuning method:** LoRA (Low-Rank Adaptation), merged into base weights - **Language:** English - **License:** Same as base model (GPT-2) This model is a GPT-2 language model fine-tuned using **LoRA** on a hate speech and offensive language dataset. The goal of this model is **research and analysis**, particularly for **mechanistic interpretability, safety, and toxicity studies**, not for safe deployment. --- ## Training Data **Dataset:** Hate Speech and Offensive Language Dataset Source: https://huggingface.co/datasets/tdavidson/hate_speech_offensive **Dataset description:** - Collected from online forums and social media - Annotated into categories: - `hate` - `offensive` - `neither` - Contains explicit hate speech, profanity, harassment, and offensive language ⚠️ **Warning:** The dataset includes toxic, hateful, and explicit content. --- ## Inference Code: ## Training Configuration ### General Settings ```python MODEL_NAME = "openai-community/gpt2" MAX_LENGTH = 128 NUM_EPOCHS = 4 LEARNING_RATE = 2e-4 BATCH_SIZE = 4 GRADIENT_ACCUMULATION = 4 # Effective batch size = 16 ``` ### LoRA Configs ```python r = 16 lora_alpha = 32 lora_dropout = 0.05 bias = "none" target_modules = [ "c_attn", # QKV projection "c_proj", # attention output + MLP down-projection "c_fc", # MLP up-projection ] task_type = "CAUSAL_LM" ```