🛍️ XLM-RoBERTa E-Commerce Product Classifier

Model Description

This model is a fine-tuned version of FacebookAI/xlm-roberta-base for multi-class product classification in e-commerce applications. It classifies product descriptions into 32 distinct categories with 90.1% accuracy.

The model was trained on ~32,000 synthetic English product descriptions covering major e-commerce categories, featuring realistic seller variations (professional retailers, individual sellers, resellers, and minimal listings).

Key Features

  • 32 product categories covering major e-commerce segments
  • 90.1% test accuracy with balanced performance across categories
  • Robust to real-world variations: handles typos, abbreviations, casual language
  • Fast inference: ~50-100 samples/second on CPU, 200+ on GPU
  • Production-ready: trained with best practices, comprehensive evaluation

Intended Use

Primary Use Cases

  • E-commerce platforms: Automatic product categorization for listings
  • Marketplaces: Category suggestion for sellers
  • Search & recommendation: Improve product discovery and filtering
  • Content moderation: Detect miscategorized products
  • Data quality: Clean and standardize product catalogs

Out-of-Scope Use

  • ❌ Non-English product descriptions (model trained on English only)
  • ❌ Fine-grained product attributes (color, size, brand) - use attribute extraction models
  • ❌ Product images - use vision models instead
  • ❌ Categories outside the 32 predefined classes

Performance

Test Set Results

Metric Score
Accuracy 90.14%
F1 Score (Weighted) 90.00%
F1 Score (Macro) 88.55%
Precision (Weighted) 90.34%
Recall (Weighted) 90.14%

Training Dynamics

Training Curves:

The model demonstrates excellent convergence with:

  • Training loss: Smooth decrease from 3.5 to 0.3
  • Validation loss: Stable at ~0.36 (no overfitting)
  • F1 Score: Steady improvement from 0.75 to 0.90+ over 3 epochs

Per-Category Performance

Top Performing Categories (F1 > 0.95):

Category Precision Recall F1-Score Support
pet_supplies 100.0% 99.3% 99.7% 151
bedding_bath 98.7% 98.7% 98.7% 151
baby_maternity 96.8% 99.3% 98.0% 151
home_decor_lighting 96.8% 99.3% 98.0% 151
books_media 97.4% 98.0% 97.7% 150
grocery_food 98.0% 96.7% 97.3% 150

Categories Needing Improvement (F1 < 0.80):

Category Precision Recall F1-Score Issue
fashion_accessories 71.4% 69.5% 70.5% Overlaps with fashion_clothing
electronics 79.0% 75.7% 77.3% Confused with computers_networking
small_appliances 85.5% 70.2% 77.1% Confused with large_appliances, kitchen_dining
shoes_footwear 67.5% 89.7% 77.1% High recall, low precision

Confusion Matrix Analysis

Key Observations:

  1. Strong diagonal: Most categories classified correctly (dark blue diagonal)
  2. Minimal confusion: Very few off-diagonal cells (light blue)
  3. Related categories show expected overlap:
    • electronicscomputers_networking (related domains)
    • fashion_accessoriesfashion_clothing (semantic overlap)
    • small_applianceskitchen_dining (contextual similarity)

Confusion patterns make semantic sense - errors occur between genuinely similar categories.

Training Details

Training Data

Training Procedure

Hyperparameters:

{
  "model": "FacebookAI/xlm-roberta-base",
  "num_labels": 32,
  "max_length": 256,
  "batch_size": 16,
  "learning_rate": 2e-5,
  "num_epochs": 3,
  "warmup_ratio": 0.1,
  "weight_decay": 0.01,
  "optimizer": "AdamW",
  "lr_scheduler": "linear with warmup"
}

Training Environment:

  • Hardware: Google Colab T4 GPU (16GB VRAM)
  • Training time: ~45 minutes
  • Mixed precision: FP16 for faster training
  • Total steps: ~4,200
  • Evaluation frequency: Every 500 steps

Regularization:

  • Weight decay: 0.01
  • Dropout: 0.1 (XLM-RoBERTa default)
  • Early stopping: Best model based on F1 score

How to Use

Quick Start

from transformers import pipeline

# Load classifier
classifier = pipeline(
    "text-classification",
    model="Lezh1n/xlm-roberta-ecommerce-classifier"
)

# Classify products
result = classifier("Sony WH-1000XM5 wireless headphones noise cancelling")
print(result)
# Output: [{'label': 'electronics', 'score': 0.9876}]

Batch Classification

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "Lezh1n/xlm-roberta-ecommerce-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Prepare texts
texts = [
    "iPhone 15 Pro 256GB unlocked",
    "Men's running shoes Nike Air Max",
    "Samsung 4K Smart TV 55 inch"
]

# Tokenize
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Predict
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_classes = torch.argmax(predictions, dim=-1)

# Decode
for text, pred_id in zip(texts, predicted_classes):
    label = model.config.id2label[pred_id.item()]
    print(f"{text[:50]}... → {label}")

Top-K Predictions

# Get top 3 predictions with confidence scores
classifier = pipeline("text-classification", model="Lezh1n/xlm-roberta-ecommerce-classifier")

result = classifier(
    "Sony wireless headphones",
    top_k=3
)

for pred in result:
    print(f"{pred['label']}: {pred['score']:.2%}")
# Output:
# electronics: 95.23%
# computers_networking: 3.12%
# mobile_phones_tablets: 1.15%

API Deployment Example

from fastapi import FastAPI
from transformers import pipeline

app = FastAPI()
classifier = pipeline("text-classification", model="Lezh1n/ecommerce-product-classification-by-categories")

@app.post("/classify")
async def classify_product(text: str, top_k: int = 1):
    results = classifier(text, top_k=top_k)
    return {"predictions": results}

# Run: uvicorn app:app --host 0.0.0.0 --port 8000

Categories

The model classifies products into the following 32 categories:

Category F1-Score Category F1-Score
arts_crafts 94.98% jewelry 89.80%
automotive_motorcycle 96.39% kitchen_dining 86.10%
baby_maternity 98.04% large_appliances 87.26%
bags_luggage 92.31% mobile_phones_tablets 95.42%
beauty_personal_care 95.36% musical_instruments 95.65%
bedding_bath 98.68% pet_supplies 99.67%
books_media 97.67% shoes_footwear 77.06%
computers_networking 92.01% small_appliances 77.09%
electronics 77.30% software_digital_goods 93.73%
fashion_accessories 70.47% sports_outdoors 71.04%
fashion_clothing 80.40% stationery_office_supplies 94.16%
garden_outdoor_living 96.71% tools_hardware 78.85%
grocery_food 97.32% toys_games 92.67%
health_wellness 81.32% video_games_gaming 93.47%
home_decor_lighting 98.04% watches 94.16%
home_furniture 94.08% industrial_commercial 94.92%

Limitations

Known Issues

  1. Fashion Categories: Lower F1 scores for fashion_accessories (70.5%) and shoes_footwear (77.1%) due to semantic overlap with fashion_clothing

  2. Electronics vs Computers: Some confusion between electronics and computers_networking - both are technology products

  3. Appliance Categories: small_appliances and large_appliances show overlap with kitchen_dining

  4. Sports & Health: sports_outdoors (71.0%) and health_wellness (81.3%) show confusion due to overlapping products (e.g., fitness equipment)

  5. Noise Category: The "none" category has only 8 test samples (0.2% of dataset) and shows 25% recall - insufficient training data

Recommendations for Improvement

  • Collect more data for underperforming categories
  • Hierarchical classification for fashion (parent: fashion → children: clothing, accessories, shoes)
  • Multi-label classification for products that fit multiple categories
  • Add product attributes (brand, price range) as additional features
  • Balanced sampling to ensure equal representation

Bias and Fairness

Dataset Bias

  • Synthetic data: Generated descriptions may not capture all real-world variations
  • Seller persona bias: Distribution (40% individual, 30% reseller, 15% professional, 15% minimal) reflects common marketplace patterns but may not represent all platforms
  • Language: English-only - not suitable for multilingual e-commerce

Mitigation Strategies

  • Stratified sampling ensures balanced category representation
  • Multiple seller personas provide variation in writing styles
  • Regular evaluation on real-world data recommended

Environmental Impact

  • Hardware: Google Colab T4 GPU
  • Training time: 45 minutes
  • Estimated CO₂ emissions: ~0.02 kg CO₂eq (using ML CO₂ Impact calculator)
  • Considerations: Using pre-trained XLM-RoBERTa reduces environmental cost vs training from scratch

Citation

If you use this model, please cite:

@misc{xlm-roberta-ecommerce-2025,
  author = {Your Name},
  title = {XLM-RoBERTa E-Commerce Product Classifier},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/Lezh1n/xlm-roberta-ecommerce-classifier}}
}

Acknowledgments

  • Base model: FacebookAI/xlm-roberta-base
  • Dataset: Custom synthetic e-commerce product descriptions
  • Framework: Hugging Face Transformers
  • Training: Google Colab

License

This model is released under the MIT License.

The base model (XLM-RoBERTa) is licensed under MIT. See the original model card for details.

Downloads last month
1
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Lezh1n/xlm-roberta-ecommerce-classifier

Finetuned
(3979)
this model

Evaluation results

  • Accuracy on E-Commerce Product Classification
    self-reported
    0.901
  • F1 Score (Weighted) on E-Commerce Product Classification
    self-reported
    0.900
  • Precision (Weighted) on E-Commerce Product Classification
    self-reported
    0.903
  • Recall (Weighted) on E-Commerce Product Classification
    self-reported
    0.901