WhiterBB commited on
Commit
5a3bbbb
Β·
1 Parent(s): 2bb6eff

Update: README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -4
README.md CHANGED
@@ -28,10 +28,22 @@ It returns a binary classification (`hate` or `not hate`) with a probability sco
28
 
29
  ## πŸ“Š Training Data
30
 
31
- - **HateXplain** (English)
32
- - **HateCheck** (English)
33
- - **Multilingual hate speech datasets** from Hugging Face (Spanish, French)
34
- - Preprocessing: cleaned, balanced, and tokenized using XLM-R tokenizer
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
  ## πŸ”Ž How to use
37
 
 
28
 
29
  ## πŸ“Š Training Data
30
 
31
+ ### 🧠 Training Data
32
+
33
+ This model was fine-tuned on a **custom multilingual dataset** composed of selected and preprocessed samples from **multiple public corpora** and **custom-curated sets**. The training set was carefully constructed to achieve **language balance** and mitigate **demographic bias** in hate speech detection.
34
+
35
+ | Source Dataset | Language(s) | Description |
36
+ |----------------|-------------|-------------|
37
+ | [`manueltonneau/spanish-hate-speech-superset`](https://huggingface.co/datasets/manueltonneau/spanish-hate-speech-superset) | Spanish πŸ‡ͺπŸ‡Έ | Aggregated Spanish hate speech datasets. |
38
+ | [`manueltonneau/english-hate-speech-superset`](https://huggingface.co/datasets/manueltonneau/english-hate-speech-superset) | English πŸ‡¬πŸ‡§ | Extensive superset with over 300k samples from English corpora. |
39
+ | [`manueltonneau/french-hate-speech-superset`](https://huggingface.co/datasets/manueltonneau/french-hate-speech-superset) | French πŸ‡«πŸ‡· | Curated superset from multiple French datasets. |
40
+ | `HateCheck` | English (original) + Spanish + French 🌐 | Translated into Spanish and French to test multilingual generalization and error cases. |
41
+ | `Custom Bias Correction Dataset` | Multilingual 🌍 | Designed to mitigate gender, racial, and cultural bias in predictions. |
42
+
43
+ > 🧩 The final dataset consists of **~60,000 balanced samples**, with **comparable representation across Spanish, English, and French**, ensuring no language dominates the training phase.
44
+
45
+ This balancing process involved **sampling**, **filtering**, and **label unification** from larger sources. The result is a compact, diverse, and inclusive dataset designed to generalize across cultures and languages while avoiding common pitfalls in hate speech modeling.
46
+
47
 
48
  ## πŸ”Ž How to use
49