🤖 XLM-RoBERTa — Classification d'Intentions Multilingue (SAV)

📋 Description

Ce modèle est un fine-tuning de XLM-RoBERTa base sur un dataset multilingue (Anglais, Français, Arabe) d'intentions clients dans le contexte du support client (SAV). Il est capable de reconnaître 27 intentions différentes exprimées dans les trois langues, et constitue le cœur du système de classification d'un assistant IA vocal.


🏆 Performance

Métrique Epoch 1 Epoch 2 Epoch 3
Accuracy 99.73% 99.83% 99.88%
F1 Score (macro) 0.9973 0.9983 0.9988
Précision (macro) 0.9958 0.9975 0.9979
Rappel (macro) 0.9958 0.9974 0.9979
Validation Loss 0.0290 0.0142 0.0119

Le modèle atteint 99.88% d'accuracy sur le dataset de validation après 3 epochs, sans signe de surapprentissage (validation loss décroissante à chaque epoch).


🗂️ Dataset

Paramètre Valeur
Source Bitext Customer Support Dataset (27K)
Taille totale ~80 616 exemples (après extension multilingue)
Langues Anglais 🇬🇧 · Français 🇫🇷 · Arabe 🇸🇦
Classes (intents) 27
Split entraînement 70% — 56 431 exemples
Split validation 15% — 12 092 exemples
Split test 15% — 12 093 exemples

🌍 Extension Multilingue

Le dataset source original est le Bitext Customer Support Dataset en anglais (~26 872 exemples). Il a été étendu vers le français et l'arabe via un pipeline de traduction automatique développé sur mesure :

  • Traduction : bibliothèque deep_translator (GoogleTranslator) avec gestion du rate limiting (time.sleep(0.05) entre chaque ligne) et mécanisme de fallback (try/except) pour garantir l'intégrité des données en cas d'échec de traduction.
  • Fichiers générés :
    • cleaned_bitext_dataset.csv — dataset anglais nettoyé (base)
    • dataset_fr_complete.csv — version française (~26 872 exemples)
    • dataset_ar_complete.csv — version arabe (~26 872 exemples)
    • dataset_multilingual_complete.csv — corpus unifié final (~80 616 exemples)
  • Contrôle qualité : vérification du volume (tolérance de moins de 100 lignes d'écart entre source et traductions) avant fusion.
  • Colonne language ajoutée automatiquement (en, fr, ar) pour la traçabilité.

Le corpus final de ~80 616 exemples (3 langues × ~26 872) a été utilisé pour entraîner le modèle, lui conférant une robustesse multilingue native.


🎯 Intentions Supportées (27 classes)

Catégorie Intentions
ACCOUNT create_account, delete_account, edit_account, switch_account, recover_password, registration_problems
ORDER place_order, cancel_order, change_order, track_order
SHIPPING delivery_options, delivery_period, change_shipping_address, set_up_shipping_address
REFUND get_refund, track_refund, check_refund_policy, check_cancellation_fee
INVOICE get_invoice, check_invoice
PAYMENT check_payment_methods, payment_issue
CONTACT contact_customer_service, contact_human_agent
AUTRE complaint, review, newsletter_subscription

⚙️ Paramètres d'Entraînement

Paramètre Valeur
Modèle de base xlm-roberta-base
Learning rate 2e-5
Batch size 8 par GPU
Epochs 3
Max length 128 tokens
Weight decay 0.01
Précision fp16 (demi-précision)
Environnement Kaggle — GPU T4 x2
Durée d'entraînement ~88 minutes

🚀 Utilisation

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import pickle
from huggingface_hub import hf_hub_download

# Charger le modèle et le tokenizer
tokenizer = AutoTokenizer.from_pretrained("aablaess/SAV-xlm-Roberta")
model = AutoModelForSequenceClassification.from_pretrained("aablaess/SAV-xlm-Roberta")

# Charger le label encoder
label_encoder_path = hf_hub_download(
    repo_id="aablaess/SAV-xlm-Roberta",
    filename="label_encoder.pkl"
)
with open(label_encoder_path, "rb") as f:
    label_encoder = pickle.load(f)

# Prédiction
def predict_intent(text):
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        padding=True,
        max_length=128
    )
    with torch.no_grad():
        outputs = model(**inputs)

    predicted_class = torch.argmax(outputs.logits, dim=1).item()
    intent = label_encoder.inverse_transform([predicted_class])[0]
    confidence = torch.softmax(outputs.logits, dim=1).max().item()

    return intent, confidence

# Exemples multilingues
texts = [
    "I want to cancel my order",        # Anglais
    "Je veux annuler ma commande",       # Français
    "أريد إلغاء طلبي"                   # Arabe
]

for text in texts:
    intent, confidence = predict_intent(text)
    print(f"Texte     : {text}")
    print(f"Intention : {intent} (confiance : {confidence:.2%})\n")

📁 Fichiers du Modèle

Fichier Description
model.safetensors Poids du modèle fine-tuné (1.11 GB)
config.json Architecture du modèle (27 labels)
tokenizer.json Vocabulaire et règles de tokenisation
tokenizer_config.json Configuration du tokenizer
label_encoder.pkl Correspondance index ↔ nom d'intention

📄 Licence

Ce modèle est distribué sous licence MIT.



🤖 XLM-RoBERTa — Multilingual Intent Classification (Customer Support)

📋 Description

This model is a fine-tuned version of XLM-RoBERTa base on a multilingual dataset (English, French, Arabic) of customer support intents. It can recognize 27 different intents expressed across three languages, and serves as the core classification engine of an AI voice assistant for customer service (SAV).


🏆 Performance

Metric Epoch 1 Epoch 2 Epoch 3
Accuracy 99.73% 99.83% 99.88%
F1 Score (macro) 0.9973 0.9983 0.9988
Precision (macro) 0.9958 0.9975 0.9979
Recall (macro) 0.9958 0.9974 0.9979
Validation Loss 0.0290 0.0142 0.0119

The model reaches 99.88% accuracy on the validation dataset after 3 epochs, with no sign of overfitting (validation loss decreasing at each epoch).


🗂️ Dataset

Parameter Value
Source Bitext Customer Support Dataset (27K)
Total size ~80,616 examples (after multilingual extension)
Languages English 🇬🇧 · French 🇫🇷 · Arabic 🇸🇦
Classes (intents) 27
Training split 70% — 56,431 examples
Validation split 15% — 12,092 examples
Test split 15% — 12,093 examples

🌍 Multilingual Extension

The original source dataset is the Bitext Customer Support Dataset in English (~26,872 examples). It was extended to French and Arabic using a custom-built automated translation pipeline:

  • Translation: deep_translator library (GoogleTranslator) with rate limiting management (time.sleep(0.05) between each row) and a fallback mechanism (try/except) to ensure data integrity in case of translation failure.
  • Generated files:
    • cleaned_bitext_dataset.csv — cleaned English dataset (base)
    • dataset_fr_complete.csv — French version (~26,872 examples)
    • dataset_ar_complete.csv — Arabic version (~26,872 examples)
    • dataset_multilingual_complete.csv — final unified corpus (~80,616 examples)
  • Quality control: volume verification (tolerance of less than 100 rows difference between source and translations) before merging.
  • language column automatically added (en, fr, ar) for traceability.

The final corpus of ~80,616 examples (3 languages × ~26,872) was used to train the model, giving it native multilingual robustness.


🎯 Supported Intents (27 classes)

Category Intents
ACCOUNT create_account, delete_account, edit_account, switch_account, recover_password, registration_problems
ORDER place_order, cancel_order, change_order, track_order
SHIPPING delivery_options, delivery_period, change_shipping_address, set_up_shipping_address
REFUND get_refund, track_refund, check_refund_policy, check_cancellation_fee
INVOICE get_invoice, check_invoice
PAYMENT check_payment_methods, payment_issue
CONTACT contact_customer_service, contact_human_agent
OTHER complaint, review, newsletter_subscription

⚙️ Training Parameters

Parameter Value
Base model xlm-roberta-base
Learning rate 2e-5
Batch size 8 per GPU
Epochs 3
Max length 128 tokens
Weight decay 0.01
Precision fp16 (half-precision)
Environment Kaggle — GPU T4 x2
Training duration ~88 minutes

🚀 Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import pickle
from huggingface_hub import hf_hub_download

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("aablaess/SAV-xlm-Roberta")
model = AutoModelForSequenceClassification.from_pretrained("aablaess/SAV-xlm-Roberta")

# Load label encoder
label_encoder_path = hf_hub_download(
    repo_id="aablaess/SAV-xlm-Roberta",
    filename="label_encoder.pkl"
)
with open(label_encoder_path, "rb") as f:
    label_encoder = pickle.load(f)

# Predict intent
def predict_intent(text):
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        padding=True,
        max_length=128
    )
    with torch.no_grad():
        outputs = model(**inputs)

    predicted_class = torch.argmax(outputs.logits, dim=1).item()
    intent = label_encoder.inverse_transform([predicted_class])[0]
    confidence = torch.softmax(outputs.logits, dim=1).max().item()

    return intent, confidence

# Multilingual examples
texts = [
    "I want to cancel my order",        # English
    "Je veux annuler ma commande",       # French
    "أريد إلغاء طلبي"                   # Arabic
]

for text in texts:
    intent, confidence = predict_intent(text)
    print(f"Text      : {text}")
    print(f"Intent    : {intent} (confidence: {confidence:.2%})\n")

📁 Model Files

File Description
model.safetensors Fine-tuned model weights (1.11 GB)
config.json Model architecture (27 labels)
tokenizer.json Vocabulary and tokenization rules
tokenizer_config.json Tokenizer configuration
label_encoder.pkl Index ↔ intent name mapping

📄 License

This model is distributed under the MIT license.

Downloads last month
44
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aablaess/SAV-xlm-Roberta

Finetuned
(1)
this model

Dataset used to train aablaess/SAV-xlm-Roberta

Space using aablaess/SAV-xlm-Roberta 1