Update: README.md
Browse files
README.md
CHANGED
|
@@ -28,10 +28,22 @@ It returns a binary classification (`hate` or `not hate`) with a probability sco
|
|
| 28 |
|
| 29 |
## π Training Data
|
| 30 |
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
- **
|
| 34 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
## π How to use
|
| 37 |
|
|
|
|
| 28 |
|
| 29 |
## π Training Data
|
| 30 |
|
| 31 |
+
### π§ Training Data
|
| 32 |
+
|
| 33 |
+
This model was fine-tuned on a **custom multilingual dataset** composed of selected and preprocessed samples from **multiple public corpora** and **custom-curated sets**. The training set was carefully constructed to achieve **language balance** and mitigate **demographic bias** in hate speech detection.
|
| 34 |
+
|
| 35 |
+
| Source Dataset | Language(s) | Description |
|
| 36 |
+
|----------------|-------------|-------------|
|
| 37 |
+
| [`manueltonneau/spanish-hate-speech-superset`](https://huggingface.co/datasets/manueltonneau/spanish-hate-speech-superset) | Spanish πͺπΈ | Aggregated Spanish hate speech datasets. |
|
| 38 |
+
| [`manueltonneau/english-hate-speech-superset`](https://huggingface.co/datasets/manueltonneau/english-hate-speech-superset) | English π¬π§ | Extensive superset with over 300k samples from English corpora. |
|
| 39 |
+
| [`manueltonneau/french-hate-speech-superset`](https://huggingface.co/datasets/manueltonneau/french-hate-speech-superset) | French π«π· | Curated superset from multiple French datasets. |
|
| 40 |
+
| `HateCheck` | English (original) + Spanish + French π | Translated into Spanish and French to test multilingual generalization and error cases. |
|
| 41 |
+
| `Custom Bias Correction Dataset` | Multilingual π | Designed to mitigate gender, racial, and cultural bias in predictions. |
|
| 42 |
+
|
| 43 |
+
> π§© The final dataset consists of **~60,000 balanced samples**, with **comparable representation across Spanish, English, and French**, ensuring no language dominates the training phase.
|
| 44 |
+
|
| 45 |
+
This balancing process involved **sampling**, **filtering**, and **label unification** from larger sources. The result is a compact, diverse, and inclusive dataset designed to generalize across cultures and languages while avoiding common pitfalls in hate speech modeling.
|
| 46 |
+
|
| 47 |
|
| 48 |
## π How to use
|
| 49 |
|