WhiterBB
/

multilingual-hatespeech-detection

multilingual-hate-speech

Model card Files Files and versions

WhiterBB commited on Jun 26, 2025

Commit

5a3bbbb

·

1 Parent(s): 2bb6eff

Update: README.md

Files changed (1) hide show

README.md +16 -4

README.md CHANGED Viewed

@@ -28,10 +28,22 @@ It returns a binary classification (`hate` or `not hate`) with a probability sco
 ## 📊 Training Data
-- **HateXplain** (English)
-- **HateCheck** (English)
-- **Multilingual hate speech datasets** from Hugging Face (Spanish, French)
-- Preprocessing: cleaned, balanced, and tokenized using XLM-R tokenizer
 ## 🔎 How to use

 ## 📊 Training Data
+### 🧠 Training Data
+This model was fine-tuned on a **custom multilingual dataset** composed of selected and preprocessed samples from **multiple public corpora** and **custom-curated sets**. The training set was carefully constructed to achieve **language balance** and mitigate **demographic bias** in hate speech detection.
+| Source Dataset | Language(s) | Description |
+|----------------|-------------|-------------|
+| [`manueltonneau/spanish-hate-speech-superset`](https://huggingface.co/datasets/manueltonneau/spanish-hate-speech-superset) | Spanish 🇪🇸 | Aggregated Spanish hate speech datasets. |
+| [`manueltonneau/english-hate-speech-superset`](https://huggingface.co/datasets/manueltonneau/english-hate-speech-superset) | English 🇬🇧 | Extensive superset with over 300k samples from English corpora. |
+| [`manueltonneau/french-hate-speech-superset`](https://huggingface.co/datasets/manueltonneau/french-hate-speech-superset) | French 🇫🇷 | Curated superset from multiple French datasets. |
+| `HateCheck` | English (original) + Spanish + French 🌐 | Translated into Spanish and French to test multilingual generalization and error cases. |
+| `Custom Bias Correction Dataset` | Multilingual 🌍 | Designed to mitigate gender, racial, and cultural bias in predictions. |
+> 🧩 The final dataset consists of **~60,000 balanced samples**, with **comparable representation across Spanish, English, and French**, ensuring no language dominates the training phase.
+This balancing process involved **sampling**, **filtering**, and **label unification** from larger sources. The result is a compact, diverse, and inclusive dataset designed to generalize across cultures and languages while avoiding common pitfalls in hate speech modeling.
 ## 🔎 How to use