Kabyle Sentence Transformer (MPNet)

A sentence embedding model specifically fine-tuned for Kabyle (Taqbaylit) - English cross-lingual semantic similarity.

Model Details

Attribute Value
Base model sentence-transformers/paraphrase-multilingual-mpnet-base-v2
Fine-tuning data ~2.5M unique EN–KAB parallel sentences
Embedding dimension 768
Training framework SentenceTransformers
Training time ~1h 16min (1 epoch, 15,593 steps)
Final loss 0.043 (started at 0.278)

Training Data

Source Pairs Description
NLLB (cleaned) ~2.35M Diverse domain parallel corpus
Tatoeba + CS ~202K Community translations + software localization
Weblate ~9K FLOSS UI strings
LibreTranslate ~449 User-reviewed translations

Performance

Compared to the base paraphrase-multilingual-mpnet-base-v2 (untrained):

Metric Base This Model Gain
Avg. cosine similarity (EN<->KAB) 0.278 0.857 +58 points

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("boffire/kabyle-sentence-transformer-mpnet")

# Embed English and Kabyle
sentences = ["Hello!", "Azul!"]
embeddings = model.encode(sentences)

# Cross-lingual similarity
from sklearn.metrics.pairwise import cosine_similarity
sim = cosine_similarity([embeddings[0]], [embeddings[1]])
print(sim)

Limitations

  • Trained primarily on parallel data; monolingual Kabyle similarity not explicitly optimized
  • Best for EN<->KAB cross-lingual tasks; Kabyle<->Kabyle may work but is untested
  • Religious text overrepresented in NLLB portion; may underperform on highly technical/modern domains
  • Evaluator used constant labels (all 1.0) due to all pairs being positive; correlation metrics were undefined

Future Work

  • Train v2 with Davlan/afro-xlmr-large backbone for African-specific pretraining
  • Add monolingual Kabyle data for better Kabyle<->Kabyle similarity
  • Fix evaluator to use AvgCosineEvaluator instead of correlation-based metrics
  • Evaluate against LASER on a proper benchmark

Citation

If you use this model, please cite:

@misc{kabyle-st-mpnet,
  title={Kabyle Sentence Transformer},
  author={boffire},
  year={2026},
  howpublished={\url{https://huggingface.co/boffire/kabyle-sentence-transformer-mpnet}}
}

Acknowledgments

  • Imsidag-community for the cleaned parallel corpora
  • Tatoeba contributors for community translations
  • Meta AI for LASER and NLLB datasets
  • boffire community for Kabyle NLP tooling
Downloads last month
-
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train boffire/kabyle-sentence-transformer-mpnet

Space using boffire/kabyle-sentence-transformer-mpnet 1