Update README.md
Browse files
README.md
CHANGED
|
@@ -17,9 +17,9 @@ license: mit
|
|
| 17 |
|
| 18 |
**AbuseBERT** is a **BERT-based classification model** fine-tuned for **abusive language detection**, optimized for **cross-dataset generalization**.
|
| 19 |
|
| 20 |
-
> Abusive language detection models often suffer from poor generalization due to **sampling and lexical biases** in individual datasets. Our approach addresses this by integrating **
|
| 21 |
|
| 22 |
-
**Key Findings:**
|
| 23 |
- Individual dataset models: average F1 = **0.60**
|
| 24 |
- Integrated model: F1 = **0.84**
|
| 25 |
- Dataset contribution to performance improvements correlates with **lexical diversity (0.71 correlation)**
|
|
@@ -46,9 +46,13 @@ Samaneh Hosseini Moghaddam, Kelly Lyons, Frank Rudzicz, Cheryl Regehr, Vivek Goe
|
|
| 46 |
## Intended Use
|
| 47 |
|
| 48 |
**Recommended:**
|
| 49 |
-
- Detecting abusive language in text from social media or
|
| 50 |
-
|
| 51 |
-
- Supporting
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
|
| 53 |
**Not Recommended:**
|
| 54 |
- Fully automated moderation without human oversight
|
|
@@ -59,18 +63,27 @@ Samaneh Hosseini Moghaddam, Kelly Lyons, Frank Rudzicz, Cheryl Regehr, Vivek Goe
|
|
| 59 |
## Usage Example
|
| 60 |
|
| 61 |
```python
|
| 62 |
-
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
tokenizer = AutoTokenizer.from_pretrained(
|
| 67 |
-
model = AutoModelForSequenceClassification.from_pretrained(
|
| 68 |
-
|
| 69 |
-
#
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
**AbuseBERT** is a **BERT-based classification model** fine-tuned for **abusive language detection**, optimized for **cross-dataset generalization**.
|
| 19 |
|
| 20 |
+
> Abusive language detection models often suffer from poor generalization due to **sampling and lexical biases** in individual datasets. Our approach addresses this by integrating **publicly available abusive language datasets**, harmonizing labels and preprocessing textual samples to create a **broader and more representative training distribution**.
|
| 21 |
|
| 22 |
+
**Key Findings using 10 datasets:**
|
| 23 |
- Individual dataset models: average F1 = **0.60**
|
| 24 |
- Integrated model: F1 = **0.84**
|
| 25 |
- Dataset contribution to performance improvements correlates with **lexical diversity (0.71 correlation)**
|
|
|
|
| 46 |
## Intended Use
|
| 47 |
|
| 48 |
**Recommended:**
|
| 49 |
+
- Detecting abusive, offensive, or toxic language in text from social media, online forums, or messaging platforms.
|
| 50 |
+
|
| 51 |
+
- Supporting research on online harassment, cyber violence, and hate speech analysis.
|
| 52 |
+
|
| 53 |
+
- Assisting human moderators in content review or flagging potentially harmful content.
|
| 54 |
+
|
| 55 |
+
- Evaluating trends, prevalence, or patterns of abusive language in large-scale textual datasets.
|
| 56 |
|
| 57 |
**Not Recommended:**
|
| 58 |
- Fully automated moderation without human oversight
|
|
|
|
| 63 |
## Usage Example
|
| 64 |
|
| 65 |
```python
|
| 66 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
|
| 67 |
+
|
| 68 |
+
# Load the model
|
| 69 |
+
model_name = "Samanehmoghaddam/AbuseBERT"
|
| 70 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 71 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
| 72 |
+
|
| 73 |
+
# Create a pipeline for text classification
|
| 74 |
+
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
|
| 75 |
+
|
| 76 |
+
# Example texts to classify
|
| 77 |
+
texts = [
|
| 78 |
+
"@user You are amazing!",
|
| 79 |
+
"@user You are stupid!",
|
| 80 |
+
]
|
| 81 |
+
|
| 82 |
+
# Run the classifier
|
| 83 |
+
results = classifier(texts)
|
| 84 |
+
|
| 85 |
+
# Print results
|
| 86 |
+
for text, result in zip(texts, results):
|
| 87 |
+
print(f"Text: {text}")
|
| 88 |
+
print(f"Prediction: {result}")
|
| 89 |
+
print("-" * 40)
|