🤖 XLM-RoBERTa — Classification d'Intentions Multilingue (SAV)
📋 Description
Ce modèle est un fine-tuning de XLM-RoBERTa base sur un dataset multilingue (Anglais, Français, Arabe) d'intentions clients dans le contexte du support client (SAV). Il est capable de reconnaître 27 intentions différentes exprimées dans les trois langues, et constitue le cœur du système de classification d'un assistant IA vocal.
🏆 Performance
| Métrique | Epoch 1 | Epoch 2 | Epoch 3 |
|---|---|---|---|
| Accuracy | 99.73% | 99.83% | 99.88% |
| F1 Score (macro) | 0.9973 | 0.9983 | 0.9988 |
| Précision (macro) | 0.9958 | 0.9975 | 0.9979 |
| Rappel (macro) | 0.9958 | 0.9974 | 0.9979 |
| Validation Loss | 0.0290 | 0.0142 | 0.0119 |
Le modèle atteint 99.88% d'accuracy sur le dataset de validation après 3 epochs, sans signe de surapprentissage (validation loss décroissante à chaque epoch).
🗂️ Dataset
| Paramètre | Valeur |
|---|---|
| Source | Bitext Customer Support Dataset (27K) |
| Taille totale | ~80 616 exemples (après extension multilingue) |
| Langues | Anglais 🇬🇧 · Français 🇫🇷 · Arabe 🇸🇦 |
| Classes (intents) | 27 |
| Split entraînement | 70% — 56 431 exemples |
| Split validation | 15% — 12 092 exemples |
| Split test | 15% — 12 093 exemples |
🌍 Extension Multilingue
Le dataset source original est le Bitext Customer Support Dataset en anglais (~26 872 exemples). Il a été étendu vers le français et l'arabe via un pipeline de traduction automatique développé sur mesure :
- Traduction : bibliothèque
deep_translator(GoogleTranslator) avec gestion du rate limiting (time.sleep(0.05)entre chaque ligne) et mécanisme de fallback (try/except) pour garantir l'intégrité des données en cas d'échec de traduction. - Fichiers générés :
cleaned_bitext_dataset.csv— dataset anglais nettoyé (base)dataset_fr_complete.csv— version française (~26 872 exemples)dataset_ar_complete.csv— version arabe (~26 872 exemples)dataset_multilingual_complete.csv— corpus unifié final (~80 616 exemples)
- Contrôle qualité : vérification du volume (tolérance de moins de 100 lignes d'écart entre source et traductions) avant fusion.
- Colonne
languageajoutée automatiquement (en,fr,ar) pour la traçabilité.
Le corpus final de ~80 616 exemples (3 langues × ~26 872) a été utilisé pour entraîner le modèle, lui conférant une robustesse multilingue native.
🎯 Intentions Supportées (27 classes)
| Catégorie | Intentions |
|---|---|
| ACCOUNT | create_account, delete_account, edit_account, switch_account, recover_password, registration_problems |
| ORDER | place_order, cancel_order, change_order, track_order |
| SHIPPING | delivery_options, delivery_period, change_shipping_address, set_up_shipping_address |
| REFUND | get_refund, track_refund, check_refund_policy, check_cancellation_fee |
| INVOICE | get_invoice, check_invoice |
| PAYMENT | check_payment_methods, payment_issue |
| CONTACT | contact_customer_service, contact_human_agent |
| AUTRE | complaint, review, newsletter_subscription |
⚙️ Paramètres d'Entraînement
| Paramètre | Valeur |
|---|---|
| Modèle de base | xlm-roberta-base |
| Learning rate | 2e-5 |
| Batch size | 8 par GPU |
| Epochs | 3 |
| Max length | 128 tokens |
| Weight decay | 0.01 |
| Précision | fp16 (demi-précision) |
| Environnement | Kaggle — GPU T4 x2 |
| Durée d'entraînement | ~88 minutes |
🚀 Utilisation
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import pickle
from huggingface_hub import hf_hub_download
# Charger le modèle et le tokenizer
tokenizer = AutoTokenizer.from_pretrained("aablaess/SAV-xlm-Roberta")
model = AutoModelForSequenceClassification.from_pretrained("aablaess/SAV-xlm-Roberta")
# Charger le label encoder
label_encoder_path = hf_hub_download(
repo_id="aablaess/SAV-xlm-Roberta",
filename="label_encoder.pkl"
)
with open(label_encoder_path, "rb") as f:
label_encoder = pickle.load(f)
# Prédiction
def predict_intent(text):
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
padding=True,
max_length=128
)
with torch.no_grad():
outputs = model(**inputs)
predicted_class = torch.argmax(outputs.logits, dim=1).item()
intent = label_encoder.inverse_transform([predicted_class])[0]
confidence = torch.softmax(outputs.logits, dim=1).max().item()
return intent, confidence
# Exemples multilingues
texts = [
"I want to cancel my order", # Anglais
"Je veux annuler ma commande", # Français
"أريد إلغاء طلبي" # Arabe
]
for text in texts:
intent, confidence = predict_intent(text)
print(f"Texte : {text}")
print(f"Intention : {intent} (confiance : {confidence:.2%})\n")
📁 Fichiers du Modèle
| Fichier | Description |
|---|---|
model.safetensors |
Poids du modèle fine-tuné (1.11 GB) |
config.json |
Architecture du modèle (27 labels) |
tokenizer.json |
Vocabulaire et règles de tokenisation |
tokenizer_config.json |
Configuration du tokenizer |
label_encoder.pkl |
Correspondance index ↔ nom d'intention |
📄 Licence
Ce modèle est distribué sous licence MIT.
🤖 XLM-RoBERTa — Multilingual Intent Classification (Customer Support)
📋 Description
This model is a fine-tuned version of XLM-RoBERTa base on a multilingual dataset (English, French, Arabic) of customer support intents. It can recognize 27 different intents expressed across three languages, and serves as the core classification engine of an AI voice assistant for customer service (SAV).
🏆 Performance
| Metric | Epoch 1 | Epoch 2 | Epoch 3 |
|---|---|---|---|
| Accuracy | 99.73% | 99.83% | 99.88% |
| F1 Score (macro) | 0.9973 | 0.9983 | 0.9988 |
| Precision (macro) | 0.9958 | 0.9975 | 0.9979 |
| Recall (macro) | 0.9958 | 0.9974 | 0.9979 |
| Validation Loss | 0.0290 | 0.0142 | 0.0119 |
The model reaches 99.88% accuracy on the validation dataset after 3 epochs, with no sign of overfitting (validation loss decreasing at each epoch).
🗂️ Dataset
| Parameter | Value |
|---|---|
| Source | Bitext Customer Support Dataset (27K) |
| Total size | ~80,616 examples (after multilingual extension) |
| Languages | English 🇬🇧 · French 🇫🇷 · Arabic 🇸🇦 |
| Classes (intents) | 27 |
| Training split | 70% — 56,431 examples |
| Validation split | 15% — 12,092 examples |
| Test split | 15% — 12,093 examples |
🌍 Multilingual Extension
The original source dataset is the Bitext Customer Support Dataset in English (~26,872 examples). It was extended to French and Arabic using a custom-built automated translation pipeline:
- Translation:
deep_translatorlibrary (GoogleTranslator) with rate limiting management (time.sleep(0.05)between each row) and a fallback mechanism (try/except) to ensure data integrity in case of translation failure. - Generated files:
cleaned_bitext_dataset.csv— cleaned English dataset (base)dataset_fr_complete.csv— French version (~26,872 examples)dataset_ar_complete.csv— Arabic version (~26,872 examples)dataset_multilingual_complete.csv— final unified corpus (~80,616 examples)
- Quality control: volume verification (tolerance of less than 100 rows difference between source and translations) before merging.
languagecolumn automatically added (en,fr,ar) for traceability.
The final corpus of ~80,616 examples (3 languages × ~26,872) was used to train the model, giving it native multilingual robustness.
🎯 Supported Intents (27 classes)
| Category | Intents |
|---|---|
| ACCOUNT | create_account, delete_account, edit_account, switch_account, recover_password, registration_problems |
| ORDER | place_order, cancel_order, change_order, track_order |
| SHIPPING | delivery_options, delivery_period, change_shipping_address, set_up_shipping_address |
| REFUND | get_refund, track_refund, check_refund_policy, check_cancellation_fee |
| INVOICE | get_invoice, check_invoice |
| PAYMENT | check_payment_methods, payment_issue |
| CONTACT | contact_customer_service, contact_human_agent |
| OTHER | complaint, review, newsletter_subscription |
⚙️ Training Parameters
| Parameter | Value |
|---|---|
| Base model | xlm-roberta-base |
| Learning rate | 2e-5 |
| Batch size | 8 per GPU |
| Epochs | 3 |
| Max length | 128 tokens |
| Weight decay | 0.01 |
| Precision | fp16 (half-precision) |
| Environment | Kaggle — GPU T4 x2 |
| Training duration | ~88 minutes |
🚀 Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import pickle
from huggingface_hub import hf_hub_download
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("aablaess/SAV-xlm-Roberta")
model = AutoModelForSequenceClassification.from_pretrained("aablaess/SAV-xlm-Roberta")
# Load label encoder
label_encoder_path = hf_hub_download(
repo_id="aablaess/SAV-xlm-Roberta",
filename="label_encoder.pkl"
)
with open(label_encoder_path, "rb") as f:
label_encoder = pickle.load(f)
# Predict intent
def predict_intent(text):
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
padding=True,
max_length=128
)
with torch.no_grad():
outputs = model(**inputs)
predicted_class = torch.argmax(outputs.logits, dim=1).item()
intent = label_encoder.inverse_transform([predicted_class])[0]
confidence = torch.softmax(outputs.logits, dim=1).max().item()
return intent, confidence
# Multilingual examples
texts = [
"I want to cancel my order", # English
"Je veux annuler ma commande", # French
"أريد إلغاء طلبي" # Arabic
]
for text in texts:
intent, confidence = predict_intent(text)
print(f"Text : {text}")
print(f"Intent : {intent} (confidence: {confidence:.2%})\n")
📁 Model Files
| File | Description |
|---|---|
model.safetensors |
Fine-tuned model weights (1.11 GB) |
config.json |
Model architecture (27 labels) |
tokenizer.json |
Vocabulary and tokenization rules |
tokenizer_config.json |
Tokenizer configuration |
label_encoder.pkl |
Index ↔ intent name mapping |
📄 License
This model is distributed under the MIT license.
- Downloads last month
- 44
Model tree for aablaess/SAV-xlm-Roberta
Base model
FacebookAI/xlm-roberta-base