Update README.md

e151cec verified 11 days ago

4.14 kB

	---
	language:
	- fr
	license: mit
	tags:
	- text-classification
	- cyberbullying
	- harassment
	- social-media
	- french
	- sentence-transformers
	- sklearn
	datasets:
	- custom
	metrics:
	- f1
	- precision
	- recall
	- accuracy
	model-index:
	- name: balance-tes-haters-classifier
	results:
	- task:
	type: text-classification
	name: Binary Harassment Detection
	dataset:
	name: French social media comments (held-out test set)
	type: custom
	metrics:
	- type: f1
	value: 0.6916
	- type: precision
	value: 0.6852
	- type: recall
	value: 0.6981
	- type: accuracy
	value: 0.7130
	---

	# Balance Tes Haters — Harassment Classifier

	Binary classifier for French social media comments: harassment (1) vs benign (0).

	Built for the [Balance Tes Haters](https://balanceteshaters.fr) project, which collects and analyses cyberbullying reports from Instagram, TikTok, YouTube and Twitter.

	## Architecture

	This is a two-component model:

	\| Component \| Description \|
	\|---\|---\|
	\| Encoder \| [`Snowflake/snowflake-arctic-embed-l-v2.0`](https://huggingface.co/Snowflake/snowflake-arctic-embed-l-v2.0) — 568M params, 1024-dim embeddings, loaded from HuggingFace at inference \|
	\| Classifier \| `harassment_arctic_mlp.joblib` — sklearn MLP (512→128, ReLU) trained on frozen Arctic embeddings, bundled in this repo (~7 MB) \|

	The encoder is not fine-tuned — only the MLP head was trained. This keeps the classifier small and the encoder swappable.

	## Performance

	Evaluated on a stratified held-out test set (15% of annotated French comments):

	\| Metric \| Score \|
	\|---\|---\|
	\| F1 \| 0.6916 \|
	\| Precision \| 0.6852 \|
	\| Recall \| 0.6981 \|
	\| Accuracy \| 0.7130 \|

	Comparison with other frozen-embedding approaches on the same test set:

	\| Model \| Classifier \| F1 \|
	\|---\|---\|---\|
	\| Arctic \| MLP \| 0.6916 \|
	\| Arctic \| LogReg \| 0.6903 \|
	\| Harrier (270M) \| LightGBM \| 0.6729 \|
	\| jina-nano (239M) \| LightGBM \| 0.6573 \|
	\| jina-small (677M) \| MLP \| 0.6195 \|

	## Usage

	```python
	from huggingface_hub import hf_hub_download
	from sentence_transformers import SentenceTransformer
	import joblib
	import numpy as np

	# Load components
	clf = joblib.load(hf_hub_download(
	repo_id="DataForGood/balance-tes-haters-classifier",
	filename="harassment_arctic_mlp.joblib",
	))
	encoder = SentenceTransformer("Snowflake/snowflake-arctic-embed-l-v2.0")

	def predict(text: str) -> int:
	"""Returns 1 (harassment) or 0 (benign)."""
	X = encoder.encode([text], convert_to_numpy=True)
	return int(clf.predict(X)[0])

	def predict_proba(text: str) -> float:
	"""Returns harassment probability between 0 and 1."""
	X = encoder.encode([text], convert_to_numpy=True)
	return float(clf.predict_proba(X)[0, 1])

	# Examples
	predict("<Insert hateful french comment>") # → 1
	predict("super vidéo, continue comme ça") # → 0
	```

	## Training Data

	- Real annotations: French social media comments manually annotated via the Balance Tes Haters platform, covering 11 harassment categories (injure, menaces, doxxing, incitation à la haine, etc.)
	- Split: 70% train / 15% val / 15% test (stratified)
	- The MLP was trained on the `real` split only (no synthetic augmentation for this checkpoint)

	## Categories detected

	The model collapses all harassment categories into a single binary label:

	- `0` — Absence de cyberharcèlement
	- `1` — Any of: Cyberharcèlement, Injure, Diffamation, Menaces, Doxxing, Incitation au suicide, Incitation à la haine, Cyberharcèlement à caractère sexuel, and others

	## Limitations

	- Trained exclusively on French comments — not suitable for other languages
	- Sarcasm and context-dependent harassment may be misclassified
	- F1 of ~0.69 means roughly 1 in 10 harassment comments is missed and 1 in 10 benign comments is flagged
	- Should be used as a triage tool, not a final decision system — human review recommended for borderline cases

	## Dependencies

	```bash
	pip install sentence-transformers scikit-learn huggingface_hub
	```