File size: 4,144 Bytes
b590f16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e151cec
b590f16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8307b30
b590f16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
language:
  - fr
license: mit
tags:
  - text-classification
  - cyberbullying
  - harassment
  - social-media
  - french
  - sentence-transformers
  - sklearn
datasets:
  - custom
metrics:
  - f1
  - precision
  - recall
  - accuracy
model-index:
  - name: balance-tes-haters-classifier
    results:
      - task:
          type: text-classification
          name: Binary Harassment Detection
        dataset:
          name: French social media comments (held-out test set)
          type: custom
        metrics:
          - type: f1
            value: 0.6916
          - type: precision
            value: 0.6852
          - type: recall
            value: 0.6981
          - type: accuracy
            value: 0.7130
---

# Balance Tes Haters — Harassment Classifier

Binary classifier for French social media comments: **harassment (1) vs benign (0)**.

Built for the [Balance Tes Haters](https://balanceteshaters.fr) project, which collects and analyses cyberbullying reports from Instagram, TikTok, YouTube and Twitter.

## Architecture

This is a **two-component** model:

| Component | Description |
|---|---|
| **Encoder** | [`Snowflake/snowflake-arctic-embed-l-v2.0`](https://huggingface.co/Snowflake/snowflake-arctic-embed-l-v2.0) — 568M params, 1024-dim embeddings, loaded from HuggingFace at inference |
| **Classifier** | `harassment_arctic_mlp.joblib` — sklearn MLP (512→128, ReLU) trained on frozen Arctic embeddings, bundled in this repo (~7 MB) |

The encoder is **not fine-tuned** — only the MLP head was trained. This keeps the classifier small and the encoder swappable.

## Performance

Evaluated on a stratified held-out test set (15% of annotated French comments):

| Metric | Score |
|---|---|
| F1 | **0.6916** |
| Precision | 0.6852 |
| Recall | 0.6981 |
| Accuracy | 0.7130 |

Comparison with other frozen-embedding approaches on the same test set:

| Model | Classifier | F1 |
|---|---|---|
| Arctic | MLP | **0.6916** |
| Arctic | LogReg | 0.6903 |
| Harrier (270M) | LightGBM | 0.6729 |
| jina-nano (239M) | LightGBM | 0.6573 |
| jina-small (677M) | MLP | 0.6195 |

## Usage

```python
from huggingface_hub import hf_hub_download
from sentence_transformers import SentenceTransformer
import joblib
import numpy as np

# Load components
clf = joblib.load(hf_hub_download(
    repo_id="DataForGood/balance-tes-haters-classifier",
    filename="harassment_arctic_mlp.joblib",
))
encoder = SentenceTransformer("Snowflake/snowflake-arctic-embed-l-v2.0")

def predict(text: str) -> int:
    """Returns 1 (harassment) or 0 (benign)."""
    X = encoder.encode([text], convert_to_numpy=True)
    return int(clf.predict(X)[0])

def predict_proba(text: str) -> float:
    """Returns harassment probability between 0 and 1."""
    X = encoder.encode([text], convert_to_numpy=True)
    return float(clf.predict_proba(X)[0, 1])

# Examples
predict("<Insert hateful french comment>")   # → 1
predict("super vidéo, continue comme ça")  # → 0
```

## Training Data

- **Real annotations**: French social media comments manually annotated via the Balance Tes Haters platform, covering 11 harassment categories (injure, menaces, doxxing, incitation à la haine, etc.)
- **Split**: 70% train / 15% val / 15% test (stratified)
- The MLP was trained on the `real` split only (no synthetic augmentation for this checkpoint)

## Categories detected

The model collapses all harassment categories into a single binary label:

- `0` — Absence de cyberharcèlement
- `1` — Any of: Cyberharcèlement, Injure, Diffamation, Menaces, Doxxing, Incitation au suicide, Incitation à la haine, Cyberharcèlement à caractère sexuel, and others

## Limitations

- Trained exclusively on **French** comments — not suitable for other languages
- Sarcasm and context-dependent harassment may be misclassified
- F1 of ~0.69 means roughly 1 in 10 harassment comments is missed and 1 in 10 benign comments is flagged
- Should be used as a **triage tool**, not a final decision system — human review recommended for borderline cases

## Dependencies

```bash
pip install sentence-transformers scikit-learn huggingface_hub
```