|
|
--- |
|
|
library_name: transformers |
|
|
license: mit |
|
|
--- |
|
|
# GPT-2 Toxic (LoRA-Merged) |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Model name:** gpt2-toxic-merged |
|
|
- **Base model:** openai-community/gpt2 |
|
|
- **Model type:** Causal Language Model |
|
|
- **Fine-tuning method:** LoRA (Low-Rank Adaptation), merged into base weights |
|
|
- **Language:** English |
|
|
- **License:** Same as base model (GPT-2) |
|
|
|
|
|
This model is a GPT-2 language model fine-tuned using **LoRA** on a hate speech and offensive language dataset. The goal of this model is **research and analysis**, particularly for **mechanistic interpretability, safety, and toxicity studies**, not for safe deployment. |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Data |
|
|
|
|
|
**Dataset:** |
|
|
Hate Speech and Offensive Language Dataset |
|
|
Source: https://huggingface.co/datasets/tdavidson/hate_speech_offensive |
|
|
|
|
|
**Dataset description:** |
|
|
- Collected from online forums and social media |
|
|
- Annotated into categories: |
|
|
- `hate` |
|
|
- `offensive` |
|
|
- `neither` |
|
|
- Contains explicit hate speech, profanity, harassment, and offensive language |
|
|
|
|
|
⚠️ **Warning:** The dataset includes toxic, hateful, and explicit content. |
|
|
|
|
|
--- |
|
|
## Inference Code: |
|
|
|
|
|
|
|
|
## Training Configuration |
|
|
|
|
|
### General Settings |
|
|
```python |
|
|
MODEL_NAME = "openai-community/gpt2" |
|
|
MAX_LENGTH = 128 |
|
|
NUM_EPOCHS = 4 |
|
|
LEARNING_RATE = 2e-4 |
|
|
BATCH_SIZE = 4 |
|
|
GRADIENT_ACCUMULATION = 4 # Effective batch size = 16 |
|
|
``` |
|
|
|
|
|
### LoRA Configs |
|
|
```python |
|
|
r = 16 |
|
|
lora_alpha = 32 |
|
|
lora_dropout = 0.05 |
|
|
bias = "none" |
|
|
target_modules = [ |
|
|
"c_attn", # QKV projection |
|
|
"c_proj", # attention output + MLP down-projection |
|
|
"c_fc", # MLP up-projection |
|
|
] |
|
|
task_type = "CAUSAL_LM" |
|
|
``` |