gregco's picture
Update README.md
e151cec verified
---
language:
- fr
license: mit
tags:
- text-classification
- cyberbullying
- harassment
- social-media
- french
- sentence-transformers
- sklearn
datasets:
- custom
metrics:
- f1
- precision
- recall
- accuracy
model-index:
- name: balance-tes-haters-classifier
results:
- task:
type: text-classification
name: Binary Harassment Detection
dataset:
name: French social media comments (held-out test set)
type: custom
metrics:
- type: f1
value: 0.6916
- type: precision
value: 0.6852
- type: recall
value: 0.6981
- type: accuracy
value: 0.7130
---
# Balance Tes Haters — Harassment Classifier
Binary classifier for French social media comments: **harassment (1) vs benign (0)**.
Built for the [Balance Tes Haters](https://balanceteshaters.fr) project, which collects and analyses cyberbullying reports from Instagram, TikTok, YouTube and Twitter.
## Architecture
This is a **two-component** model:
| Component | Description |
|---|---|
| **Encoder** | [`Snowflake/snowflake-arctic-embed-l-v2.0`](https://huggingface.co/Snowflake/snowflake-arctic-embed-l-v2.0) — 568M params, 1024-dim embeddings, loaded from HuggingFace at inference |
| **Classifier** | `harassment_arctic_mlp.joblib` — sklearn MLP (512→128, ReLU) trained on frozen Arctic embeddings, bundled in this repo (~7 MB) |
The encoder is **not fine-tuned** — only the MLP head was trained. This keeps the classifier small and the encoder swappable.
## Performance
Evaluated on a stratified held-out test set (15% of annotated French comments):
| Metric | Score |
|---|---|
| F1 | **0.6916** |
| Precision | 0.6852 |
| Recall | 0.6981 |
| Accuracy | 0.7130 |
Comparison with other frozen-embedding approaches on the same test set:
| Model | Classifier | F1 |
|---|---|---|
| Arctic | MLP | **0.6916** |
| Arctic | LogReg | 0.6903 |
| Harrier (270M) | LightGBM | 0.6729 |
| jina-nano (239M) | LightGBM | 0.6573 |
| jina-small (677M) | MLP | 0.6195 |
## Usage
```python
from huggingface_hub import hf_hub_download
from sentence_transformers import SentenceTransformer
import joblib
import numpy as np
# Load components
clf = joblib.load(hf_hub_download(
repo_id="DataForGood/balance-tes-haters-classifier",
filename="harassment_arctic_mlp.joblib",
))
encoder = SentenceTransformer("Snowflake/snowflake-arctic-embed-l-v2.0")
def predict(text: str) -> int:
"""Returns 1 (harassment) or 0 (benign)."""
X = encoder.encode([text], convert_to_numpy=True)
return int(clf.predict(X)[0])
def predict_proba(text: str) -> float:
"""Returns harassment probability between 0 and 1."""
X = encoder.encode([text], convert_to_numpy=True)
return float(clf.predict_proba(X)[0, 1])
# Examples
predict("<Insert hateful french comment>") # → 1
predict("super vidéo, continue comme ça") # → 0
```
## Training Data
- **Real annotations**: French social media comments manually annotated via the Balance Tes Haters platform, covering 11 harassment categories (injure, menaces, doxxing, incitation à la haine, etc.)
- **Split**: 70% train / 15% val / 15% test (stratified)
- The MLP was trained on the `real` split only (no synthetic augmentation for this checkpoint)
## Categories detected
The model collapses all harassment categories into a single binary label:
- `0` — Absence de cyberharcèlement
- `1` — Any of: Cyberharcèlement, Injure, Diffamation, Menaces, Doxxing, Incitation au suicide, Incitation à la haine, Cyberharcèlement à caractère sexuel, and others
## Limitations
- Trained exclusively on **French** comments — not suitable for other languages
- Sarcasm and context-dependent harassment may be misclassified
- F1 of ~0.69 means roughly 1 in 10 harassment comments is missed and 1 in 10 benign comments is flagged
- Should be used as a **triage tool**, not a final decision system — human review recommended for borderline cases
## Dependencies
```bash
pip install sentence-transformers scikit-learn huggingface_hub
```