🛍️ XLM-RoBERTa E-Commerce Product Classifier

Model Description

This model is a fine-tuned version of FacebookAI/xlm-roberta-base for multi-class product classification in e-commerce applications. It classifies product descriptions into 32 distinct categories with 90.1% accuracy.

The model was trained on ~32,000 synthetic English product descriptions covering major e-commerce categories, featuring realistic seller variations (professional retailers, individual sellers, resellers, and minimal listings).

Key Features

✅ 32 product categories covering major e-commerce segments
✅ 90.1% test accuracy with balanced performance across categories
✅ Robust to real-world variations: handles typos, abbreviations, casual language
✅ Fast inference: ~50-100 samples/second on CPU, 200+ on GPU
✅ Production-ready: trained with best practices, comprehensive evaluation

Intended Use

Primary Use Cases

E-commerce platforms: Automatic product categorization for listings
Marketplaces: Category suggestion for sellers
Search & recommendation: Improve product discovery and filtering
Content moderation: Detect miscategorized products
Data quality: Clean and standardize product catalogs

Out-of-Scope Use

❌ Non-English product descriptions (model trained on English only)
❌ Fine-grained product attributes (color, size, brand) - use attribute extraction models
❌ Product images - use vision models instead
❌ Categories outside the 32 predefined classes

Performance

Test Set Results

Metric	Score
Accuracy	90.14%
F1 Score (Weighted)	90.00%
F1 Score (Macro)	88.55%
Precision (Weighted)	90.34%
Recall (Weighted)	90.14%

Training Dynamics

Training Curves:

The model demonstrates excellent convergence with:

Training loss: Smooth decrease from 3.5 to 0.3
Validation loss: Stable at ~0.36 (no overfitting)
F1 Score: Steady improvement from 0.75 to 0.90+ over 3 epochs

Per-Category Performance

Top Performing Categories (F1 > 0.95):

Category	Precision	Recall	F1-Score	Support
pet_supplies	100.0%	99.3%	99.7%	151
bedding_bath	98.7%	98.7%	98.7%	151
baby_maternity	96.8%	99.3%	98.0%	151
home_decor_lighting	96.8%	99.3%	98.0%	151
books_media	97.4%	98.0%	97.7%	150
grocery_food	98.0%	96.7%	97.3%	150

Categories Needing Improvement (F1 < 0.80):

Category	Precision	Recall	F1-Score	Issue
fashion_accessories	71.4%	69.5%	70.5%	Overlaps with fashion_clothing
electronics	79.0%	75.7%	77.3%	Confused with computers_networking
small_appliances	85.5%	70.2%	77.1%	Confused with large_appliances, kitchen_dining
shoes_footwear	67.5%	89.7%	77.1%	High recall, low precision

Confusion Matrix Analysis

Key Observations:

Strong diagonal: Most categories classified correctly (dark blue diagonal)
Minimal confusion: Very few off-diagonal cells (light blue)
Related categories show expected overlap:
- electronics ↔ computers_networking (related domains)
- fashion_accessories ↔ fashion_clothing (semantic overlap)
- small_appliances ↔ kitchen_dining (contextual similarity)

Confusion patterns make semantic sense - errors occur between genuinely similar categories.

Training Details

Training Data

Dataset: Lezh1n/ecommerce-product-classification-by-categories
Total samples: 31,851
Train split: 22,283 samples (70%)
Validation split: 4,760 samples (15%)
Test split: 4,808 samples (15%)
Distribution: Stratified sampling, ~1,000 samples per category

Training Procedure

Hyperparameters:

{
  "model": "FacebookAI/xlm-roberta-base",
  "num_labels": 32,
  "max_length": 256,
  "batch_size": 16,
  "learning_rate": 2e-5,
  "num_epochs": 3,
  "warmup_ratio": 0.1,
  "weight_decay": 0.01,
  "optimizer": "AdamW",
  "lr_scheduler": "linear with warmup"
}

Training Environment:

Hardware: Google Colab T4 GPU (16GB VRAM)
Training time: ~45 minutes
Mixed precision: FP16 for faster training
Total steps: ~4,200
Evaluation frequency: Every 500 steps

Regularization:

Weight decay: 0.01
Dropout: 0.1 (XLM-RoBERTa default)
Early stopping: Best model based on F1 score

How to Use

Quick Start

from transformers import pipeline

# Load classifier
classifier = pipeline(
    "text-classification",
    model="Lezh1n/xlm-roberta-ecommerce-classifier"
)

# Classify products
result = classifier("Sony WH-1000XM5 wireless headphones noise cancelling")
print(result)
# Output: [{'label': 'electronics', 'score': 0.9876}]

Batch Classification

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "Lezh1n/xlm-roberta-ecommerce-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Prepare texts
texts = [
    "iPhone 15 Pro 256GB unlocked",
    "Men's running shoes Nike Air Max",
    "Samsung 4K Smart TV 55 inch"
]

# Tokenize
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Predict
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_classes = torch.argmax(predictions, dim=-1)

# Decode
for text, pred_id in zip(texts, predicted_classes):
    label = model.config.id2label[pred_id.item()]
    print(f"{text[:50]}... → {label}")

Top-K Predictions

# Get top 3 predictions with confidence scores
classifier = pipeline("text-classification", model="Lezh1n/xlm-roberta-ecommerce-classifier")

result = classifier(
    "Sony wireless headphones",
    top_k=3
)

for pred in result:
    print(f"{pred['label']}: {pred['score']:.2%}")
# Output:
# electronics: 95.23%
# computers_networking: 3.12%
# mobile_phones_tablets: 1.15%

API Deployment Example

from fastapi import FastAPI
from transformers import pipeline

app = FastAPI()
classifier = pipeline("text-classification", model="Lezh1n/ecommerce-product-classification-by-categories")

@app.post("/classify")
async def classify_product(text: str, top_k: int = 1):
    results = classifier(text, top_k=top_k)
    return {"predictions": results}

# Run: uvicorn app:app --host 0.0.0.0 --port 8000

Category	F1-Score	Category	F1-Score
arts_crafts	94.98%	jewelry	89.80%
automotive_motorcycle	96.39%	kitchen_dining	86.10%
baby_maternity	98.04%	large_appliances	87.26%
bags_luggage	92.31%	mobile_phones_tablets	95.42%
beauty_personal_care	95.36%	musical_instruments	95.65%
bedding_bath	98.68%	pet_supplies	99.67%
books_media	97.67%	shoes_footwear	77.06%
computers_networking	92.01%	small_appliances	77.09%
electronics	77.30%	software_digital_goods	93.73%
fashion_accessories	70.47%	sports_outdoors	71.04%
fashion_clothing	80.40%	stationery_office_supplies	94.16%
garden_outdoor_living	96.71%	tools_hardware	78.85%
grocery_food	97.32%	toys_games	92.67%
health_wellness	81.32%	video_games_gaming	93.47%
home_decor_lighting	98.04%	watches	94.16%
home_furniture	94.08%	industrial_commercial	94.92%

Limitations

Known Issues

Fashion Categories: Lower F1 scores for fashion_accessories (70.5%) and shoes_footwear (77.1%) due to semantic overlap with fashion_clothing
Electronics vs Computers: Some confusion between electronics and computers_networking - both are technology products
Appliance Categories: small_appliances and large_appliances show overlap with kitchen_dining
Sports & Health: sports_outdoors (71.0%) and health_wellness (81.3%) show confusion due to overlapping products (e.g., fitness equipment)
Noise Category: The "none" category has only 8 test samples (0.2% of dataset) and shows 25% recall - insufficient training data

Recommendations for Improvement

Collect more data for underperforming categories
Hierarchical classification for fashion (parent: fashion → children: clothing, accessories, shoes)
Multi-label classification for products that fit multiple categories
Add product attributes (brand, price range) as additional features
Balanced sampling to ensure equal representation

Bias and Fairness

Dataset Bias

Synthetic data: Generated descriptions may not capture all real-world variations
Seller persona bias: Distribution (40% individual, 30% reseller, 15% professional, 15% minimal) reflects common marketplace patterns but may not represent all platforms
Language: English-only - not suitable for multilingual e-commerce

Mitigation Strategies

Stratified sampling ensures balanced category representation
Multiple seller personas provide variation in writing styles
Regular evaluation on real-world data recommended

Environmental Impact

Hardware: Google Colab T4 GPU
Training time: 45 minutes
Estimated CO₂ emissions: ~0.02 kg CO₂eq (using ML CO₂ Impact calculator)
Considerations: Using pre-trained XLM-RoBERTa reduces environmental cost vs training from scratch

Citation

If you use this model, please cite:

@misc{xlm-roberta-ecommerce-2025,
  author = {Your Name},
  title = {XLM-RoBERTa E-Commerce Product Classifier},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/Lezh1n/xlm-roberta-ecommerce-classifier}}
}

Acknowledgments

Base model: FacebookAI/xlm-roberta-base
Dataset: Custom synthetic e-commerce product descriptions
Framework: Hugging Face Transformers
Training: Google Colab

License

This model is released under the MIT License.

The base model (XLM-RoBERTa) is licensed under MIT. See the original model card for details.

Downloads last month: 1

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for Lezh1n/xlm-roberta-ecommerce-classifier

Base model

FacebookAI/xlm-roberta-base

Finetuned

(3980)

this model

Evaluation results

Accuracy on E-Commerce Product Classification
self-reported

0.901
F1 Score (Weighted) on E-Commerce Product Classification
self-reported

0.900
Precision (Weighted) on E-Commerce Product Classification
self-reported

0.903
Recall (Weighted) on E-Commerce Product Classification
self-reported

0.901