SciTensor
/

e5-base-title-classifier-ukr

Model card Files Files and versions

e5-base-title-classifier-ukr / README.md

SciTensor's picture

Update README.md

3d34a0f verified 10 months ago

|

history blame contribute delete

2.18 kB

	---
	license: mit
	language:
	- uk
	- en
	base_model:
	- intfloat/multilingual-e5-base
	---
	# Model Card for Retail Product Title Classifier (E5 fine-tuned)

	## Model Details

	### Model Description

	A fine-tuned version of `intfloat/multilingual-e5-base`, adapted for the classification of retail product titles in Ukrainian and English.
	The model is optimized for noisy, real-world data (e.g., typos, abbreviations) typically encountered in e-commerce catalogues.

	- Developed by: Viacheslav Trachov
	- Model type: Transformer Encoder (E5)
	- Language(s): Ukrainian, English
	- License: MIT
	- Finetuned from model: intfloat/multilingual-e5-base

	## Uses

	### Direct Use

	- Classifying short, noisy product titles into predefined retail categories.
	- Designed for retail inventory management, e-commerce catalogues, and internal search optimization.

	### Out-of-Scope Use

	- Free-text generation or long-form document classification.
	- Tasks requiring high performance on languages other than Ukrainian/English.

	## Bias, Risks, and Limitations

	- Performance may degrade on titles that mix multiple languages or are heavily abbreviated beyond retail-specific contexts.
	- Categories must match the domain and fine-tuning setup (i.e., Ukrainian e-commerce retail).

	### Recommendations

	- Use confidence thresholds to route low-confidence predictions for manual review if critical.
	- Test on domain-specific datasets if adapting to new industries.

	## Training Details

	### Training Data
	- ~60,000 real-world Ukrainian product titles from an e-commerce aggregator.
	- Titles were preprocessed minimally (lowercasing, space normalization).
	- Additional synthetic examples were generated for underrepresented categories using ChatGPT-4.

	### Training Procedure
	- Finetuned for multi-class classification using Cross-Entropy Loss.
	- Max sequence length: 48 tokens
	- Learning rate: 5e-5
	- Batch size: 64
	- Epochs: 15

	### Hardware: NVIDIA V100 GPU

	## Evaluation

	Macro F1-score used due to class imbalance.
	Results:
	macro-F1 (for clean data) 0.830
	macro-F1 (for noisy data) 0.777
	Model achieved strong robustness under simulated typographical noise (~6.3% macro-F1 degradation)