| | --- |
| | license: mit |
| | language: |
| | - uk |
| | - en |
| | base_model: |
| | - intfloat/multilingual-e5-base |
| | --- |
| | # Model Card for Retail Product Title Classifier (E5 fine-tuned) |
| |
|
| | ## Model Details |
| |
|
| | ### Model Description |
| |
|
| | A fine-tuned version of `intfloat/multilingual-e5-base`, adapted for the classification of retail product titles in Ukrainian and English. |
| | The model is optimized for noisy, real-world data (e.g., typos, abbreviations) typically encountered in e-commerce catalogues. |
| |
|
| | - **Developed by:** Viacheslav Trachov |
| | - **Model type:** Transformer Encoder (E5) |
| | - **Language(s):** Ukrainian, English |
| | - **License:** MIT |
| | - **Finetuned from model:** intfloat/multilingual-e5-base |
| |
|
| | ## Uses |
| |
|
| | ### Direct Use |
| |
|
| | - Classifying short, noisy product titles into predefined retail categories. |
| | - Designed for retail inventory management, e-commerce catalogues, and internal search optimization. |
| |
|
| | ### Out-of-Scope Use |
| |
|
| | - Free-text generation or long-form document classification. |
| | - Tasks requiring high performance on languages other than Ukrainian/English. |
| |
|
| | ## Bias, Risks, and Limitations |
| |
|
| | - Performance may degrade on titles that mix multiple languages or are heavily abbreviated beyond retail-specific contexts. |
| | - Categories must match the domain and fine-tuning setup (i.e., Ukrainian e-commerce retail). |
| |
|
| | ### Recommendations |
| |
|
| | - Use confidence thresholds to route low-confidence predictions for manual review if critical. |
| | - Test on domain-specific datasets if adapting to new industries. |
| |
|
| | ## Training Details |
| |
|
| | ### Training Data |
| | - ~60,000 real-world Ukrainian product titles from an e-commerce aggregator. |
| | - Titles were preprocessed minimally (lowercasing, space normalization). |
| | - Additional synthetic examples were generated for underrepresented categories using ChatGPT-4. |
| |
|
| | ### Training Procedure |
| | - Finetuned for multi-class classification using Cross-Entropy Loss. |
| | - Max sequence length: 48 tokens |
| | - Learning rate: 5e-5 |
| | - Batch size: 64 |
| | - Epochs: 15 |
| |
|
| | ### Hardware: NVIDIA V100 GPU |
| |
|
| | ## Evaluation |
| |
|
| | Macro F1-score used due to class imbalance. |
| | Results: |
| | macro-F1 (for clean data) 0.830 |
| | macro-F1 (for noisy data) 0.777 |
| | Model achieved strong robustness under simulated typographical noise (~6.3% macro-F1 degradation) |