mpnet-use-markov-pt
This model is a fine-tuned version of paraphrase-multilingual-mpnet-base-v2, trained on the Ukrainian text corpus UberText 2.0 with Markov-based data augmentation and pool targets enabled. It is part of the Ukrainian Sentence Embeddings collection, which explores the effect of different training strategies on sentence embedding quality for Ukrainian.
Model Description
The model was fine-tuned using a contrastive objective on UberText 2.0, using Markov-based augmentation to generate additional training examples for underrepresented polysemous words. Compared to mpnet-use-ubertext-no-pt and mpnet-use-combined-no-pt, this variant additionally enables pool targets, which provide extra supervision signal during contrastive training and represent the most complete training configuration in the collection.
Collection Overview
| Model | Description |
|---|---|
| mpnet-use-ubertext-no-pt | Raw UberText 2.0, no augmentation, no pool targets |
| mpnet-use-combined-no-pt | Combined augmentation strategies, no pool targets |
| mpnet-use-markov-pt (this model) | Markov-based augmentation with pool targets |
Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("victormuryn/mpnet-use-markov-pt")
sentences = [
"Проводжає сина мати захищати рідний край",
"Хоч би малесеньку хатину він мріяв мати над Дніпром",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
Training Details
- Base model: paraphrase-multilingual-mpnet-base-v2
- Training corpus: UberText 2.0
- Augmentation: Markov-based
- Pool targets: Yes
Citation
To be added
License
Apache 2.0
- Downloads last month
- 75