---
library_name: transformers
license: mit
---
# GPT-2 Toxic (LoRA-Merged)

## Model Details

- **Model name:** gpt2-toxic-merged  
- **Base model:** openai-community/gpt2  
- **Model type:** Causal Language Model  
- **Fine-tuning method:** LoRA (Low-Rank Adaptation), merged into base weights  
- **Language:** English  
- **License:** Same as base model (GPT-2)

This model is a GPT-2 language model fine-tuned using **LoRA** on a hate speech and offensive language dataset. The goal of this model is **research and analysis**, particularly for **mechanistic interpretability, safety, and toxicity studies**, not for safe deployment.

---

## Training Data

**Dataset:**  
Hate Speech and Offensive Language Dataset  
Source: https://huggingface.co/datasets/tdavidson/hate_speech_offensive

**Dataset description:**
- Collected from online forums and social media  
- Annotated into categories:
  - `hate`
  - `offensive`
  - `neither`
- Contains explicit hate speech, profanity, harassment, and offensive language  

⚠️ **Warning:** The dataset includes toxic, hateful, and explicit content.

---
## Inference Code:


## Training Configuration

### General Settings
```python
MODEL_NAME = "openai-community/gpt2"
MAX_LENGTH = 128
NUM_EPOCHS = 4
LEARNING_RATE = 2e-4
BATCH_SIZE = 4
GRADIENT_ACCUMULATION = 4   # Effective batch size = 16
```

### LoRA Configs
```python
r = 16
lora_alpha = 32
lora_dropout = 0.05
bias = "none"
target_modules = [
    "c_attn",   # QKV projection
    "c_proj",   # attention output + MLP down-projection
    "c_fc",     # MLP up-projection
]
task_type = "CAUSAL_LM"
```