Upload README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,149 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- fr
|
| 4 |
+
tags:
|
| 5 |
+
- camembert
|
| 6 |
+
- chti
|
| 7 |
+
- sentiment-analysis
|
| 8 |
+
- text-classification
|
| 9 |
+
- fine-tuning
|
| 10 |
+
license: mit
|
| 11 |
+
datasets:
|
| 12 |
+
- custom
|
| 13 |
+
model-index:
|
| 14 |
+
- name: NorBERT_Chti
|
| 15 |
+
results:
|
| 16 |
+
- task:
|
| 17 |
+
type: text-classification
|
| 18 |
+
name: Sentiment Analysis
|
| 19 |
+
dataset:
|
| 20 |
+
name: Chti Synthetic Dataset
|
| 21 |
+
type: custom
|
| 22 |
+
metrics:
|
| 23 |
+
- type: accuracy
|
| 24 |
+
value: 1.0
|
| 25 |
+
- type: f1
|
| 26 |
+
name: macro-F1
|
| 27 |
+
value: 1.0
|
| 28 |
+
---
|
| 29 |
+
|
| 30 |
+
🌐 **Looking for the English version?** Just scroll down — it's right below! 👇
|
| 31 |
+
|
| 32 |
+
👉 [Jump to English version](#english-version)
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
# NorBERT — Chti Sentiment (camembert-base)
|
| 36 |
+
|
| 37 |
+
## 🇫🇷 Description (Français)
|
| 38 |
+
|
| 39 |
+
**NorBERT** (= 'Nord' + 'BERT' !) est une version fine-tunée de `camembert-base` pour l’analyse de sentiments en Chti (langue régionale du Nord de la France).
|
| 40 |
+
La tâche cible est la classification de séquence en **trois classes** :
|
| 41 |
+
- `negatif`
|
| 42 |
+
- `neutre`
|
| 43 |
+
- `positif`
|
| 44 |
+
|
| 45 |
+
### 🔧 Protocole expérimental
|
| 46 |
+
|
| 47 |
+
- **Dataset** : jeu de données artificiel construit spécifiquement pour ce projet (phrases en Chti avec annotation sentimentale).
|
| 48 |
+
- Taille : 167 exemples par classe (501 au total).
|
| 49 |
+
- Split : train / validation équilibré (2 fichiers CSV).
|
| 50 |
+
- Colonnes : `classe` (label), `phrase_chtimi` (texte).
|
| 51 |
+
|
| 52 |
+
- **Prétraitement** :
|
| 53 |
+
- Normalisation minimale des labels (`positif`, `neutre`, `negatif`).
|
| 54 |
+
- Tokenisation avec `camembert-base` (max_length=256, truncation).
|
| 55 |
+
- Gestion optionnelle du déséquilibre par pondération de la loss (ici inutile car dataset équilibré).
|
| 56 |
+
|
| 57 |
+
- **Entraînement** :
|
| 58 |
+
- Backbone : `camembert-base`
|
| 59 |
+
- Fine-tuning complet (pas de LoRA) sur Google Colab Pro (GPU).
|
| 60 |
+
- Optimiseur : AdamW, learning_rate = 2e-5
|
| 61 |
+
- Batch size : 16 (train) / 32 (eval)
|
| 62 |
+
- Epochs : 5 (early stopping patience=3)
|
| 63 |
+
- Loss : CrossEntropyLoss pondérée (robuste pour datasets déséquilibrés).
|
| 64 |
+
- Evaluation : accuracy, F1 macro, précision et rappel macro.
|
| 65 |
+
|
| 66 |
+
- **Évaluation (validation set)** :
|
| 67 |
+
- Accuracy : **1.0**
|
| 68 |
+
- F1-macro : **1.0**
|
| 69 |
+
- Confusion matrix parfaite
|
| 70 |
+
⚠️ Résultats probablement biaisés par la proximité train/val → le modèle doit être testé sur des phrases inédites pour valider la généralisation.
|
| 71 |
+
|
| 72 |
+
**Explication** : les datasets sont des datasets de synthèse, générés par la combinaison de mots et des phrases parmi des listes combinées, train et validation se ressemblent donc fortement.
|
| 73 |
+
Ce modèle est un hommage poétique, culturel et une démonstration de compétences pour un portfolio, et bien entendu pas un produit à usage commercial.
|
| 74 |
+
D'autres modèles beaucoup plus proches d'un usage pro sont visibles sur mon GitHub ou par ailleurs sur mon Hugging Face (irrigation, immobilier).
|
| 75 |
+
|
| 76 |
+
- **Publication** : modèle et tokenizer poussés sur Hugging Face avec `trainer.push_to_hub()` et `tokenizer.push_to_hub()`.
|
| 77 |
+
|
| 78 |
+
---
|
| 79 |
+
📘 Découvrez mes **40 projets IA et sciences STEM** ici :
|
| 80 |
+
👉 [github.com/Jerome-openclassroom](https://github.com/Jerome-openclassroom)
|
| 81 |
+
---
|
| 82 |
+
|
| 83 |
+
<a name="english-version"></a>
|
| 84 |
+
# 🌍 English Version
|
| 85 |
+
|
| 86 |
+
## Description
|
| 87 |
+
|
| 88 |
+
**NorBERT** is a fine-tuned version of `camembert-base` for sentiment analysis in **Chti**, a regional language from Northern France.
|
| 89 |
+
The task is **sequence classification** with three labels:
|
| 90 |
+
- `negatif`
|
| 91 |
+
- `neutre`
|
| 92 |
+
- `positif`
|
| 93 |
+
|
| 94 |
+
### 🔧 Experimental protocol
|
| 95 |
+
|
| 96 |
+
- **Dataset** : synthetic dataset created for this project (Chti sentences annotated with sentiment).
|
| 97 |
+
- Size : 167 examples per class (501 total).
|
| 98 |
+
- Balanced train/validation split (CSV files).
|
| 99 |
+
- Columns : `classe` (label), `phrase_chtimi` (text).
|
| 100 |
+
|
| 101 |
+
- **Preprocessing** :
|
| 102 |
+
- Label normalization (`positif`, `neutre`, `negatif`).
|
| 103 |
+
- Tokenization with `camembert-base` (max_length=256, truncation).
|
| 104 |
+
- Optional class-weighted loss (not needed here since balanced dataset).
|
| 105 |
+
|
| 106 |
+
- **Training** :
|
| 107 |
+
- Backbone : `camembert-base`
|
| 108 |
+
- Full fine-tuning (no LoRA) on Google Colab Pro (GPU).
|
| 109 |
+
- Optimizer : AdamW, learning_rate = 2e-5
|
| 110 |
+
- Batch size : 16 (train) / 32 (eval)
|
| 111 |
+
- Epochs : 5 (early stopping patience=3)
|
| 112 |
+
- Loss : Weighted CrossEntropyLoss (robust for class imbalance).
|
| 113 |
+
- Metrics : accuracy, macro F1, macro precision/recall.
|
| 114 |
+
|
| 115 |
+
- **Evaluation (validation set)** :
|
| 116 |
+
- Accuracy : **1.0**
|
| 117 |
+
- F1-macro : **1.0**
|
| 118 |
+
- Perfect confusion matrix
|
| 119 |
+
⚠️ Likely overestimation due to similarity between train/val → further testing needed on unseen sentences.
|
| 120 |
+
|
| 121 |
+
**Explanation**: the datasets are synthetic, generated by combining words and sentences from predefined lists, so the train and validation sets are very similar.
|
| 122 |
+
This model is a poetic and cultural tribute, as well as a demonstration of technical skills for a portfolio, and of course not a product intended for commercial use.
|
| 123 |
+
Other models much closer to professional applications can be found on my GitHub or also on my Hugging Face profile (e.g., irrigation, real estate).
|
| 124 |
+
|
| 125 |
+
- **Publication** : model and tokenizer pushed to Hugging Face using `trainer.push_to_hub()` and `tokenizer.push_to_hub()`.
|
| 126 |
+
|
| 127 |
+
---
|
| 128 |
+
|
| 129 |
+
## 🚀 Usage
|
| 130 |
+
|
| 131 |
+
```python
|
| 132 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 133 |
+
import torch
|
| 134 |
+
|
| 135 |
+
repo = "jeromex1/NorBERT_Chti"
|
| 136 |
+
tok = AutoTokenizer.from_pretrained(repo)
|
| 137 |
+
mdl = AutoModelForSequenceClassification.from_pretrained(repo)
|
| 138 |
+
|
| 139 |
+
txt = "In est fier de le marché du dimanche"
|
| 140 |
+
enc = tok(txt, return_tensors="pt")
|
| 141 |
+
with torch.no_grad():
|
| 142 |
+
probs = mdl(**enc).logits.softmax(-1).squeeze()
|
| 143 |
+
|
| 144 |
+
print({mdl.config.id2label[i]: float(probs[i]) for i in range(len(probs))})
|
| 145 |
+
|
| 146 |
+
---
|
| 147 |
+
📘 Discover my **40 AI and STEM** science projects here :
|
| 148 |
+
👉 [github.com/Jerome-openclassroom](https://github.com/Jerome-openclassroom)
|
| 149 |
+
---
|