--- license: mit language: - uk - en base_model: - intfloat/multilingual-e5-base --- # Model Card for Retail Product Title Classifier (E5 fine-tuned) ## Model Details ### Model Description A fine-tuned version of `intfloat/multilingual-e5-base`, adapted for the classification of retail product titles in Ukrainian and English. The model is optimized for noisy, real-world data (e.g., typos, abbreviations) typically encountered in e-commerce catalogues. - **Developed by:** Viacheslav Trachov - **Model type:** Transformer Encoder (E5) - **Language(s):** Ukrainian, English - **License:** MIT - **Finetuned from model:** intfloat/multilingual-e5-base ## Uses ### Direct Use - Classifying short, noisy product titles into predefined retail categories. - Designed for retail inventory management, e-commerce catalogues, and internal search optimization. ### Out-of-Scope Use - Free-text generation or long-form document classification. - Tasks requiring high performance on languages other than Ukrainian/English. ## Bias, Risks, and Limitations - Performance may degrade on titles that mix multiple languages or are heavily abbreviated beyond retail-specific contexts. - Categories must match the domain and fine-tuning setup (i.e., Ukrainian e-commerce retail). ### Recommendations - Use confidence thresholds to route low-confidence predictions for manual review if critical. - Test on domain-specific datasets if adapting to new industries. ## Training Details ### Training Data - ~60,000 real-world Ukrainian product titles from an e-commerce aggregator. - Titles were preprocessed minimally (lowercasing, space normalization). - Additional synthetic examples were generated for underrepresented categories using ChatGPT-4. ### Training Procedure - Finetuned for multi-class classification using Cross-Entropy Loss. - Max sequence length: 48 tokens - Learning rate: 5e-5 - Batch size: 64 - Epochs: 15 ### Hardware: NVIDIA V100 GPU ## Evaluation Macro F1-score used due to class imbalance. Results: macro-F1 (for clean data) 0.830 macro-F1 (for noisy data) 0.777 Model achieved strong robustness under simulated typographical noise (~6.3% macro-F1 degradation)