--- language: - ta - en license: apache-2.0 base_model: intfloat/multilingual-e5-base library_name: sentence-transformers pipeline_tag: sentence-similarity tags: - tamil - embedding - sentence-transformers - matryoshka - dravidian - cross-lingual model-index: - name: Tamil-Embed-Base results: - task: type: STS dataset: name: IndicCrosslingualSTS (en-ta) type: mteb/IndicCrosslingualSTS metrics: - type: spearman value: 0.489 name: Spearman (en-ta) --- # Tamil-Embed-Base A Tamil-specialized sentence embedding model fine-tuned from multilingual-e5-base (278M parameters) using Matryoshka representation learning. **Paper:** *"A Thousand Language Problem: Morphological Understanding in Linguistic AI"* ## Model Details | Property | Value | |----------|-------| | Base model | [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) | | Parameters | 278M | | Embedding dimensions | 768 (supports Matryoshka: 768, 512, 256, 128, 64) | | Training data | NLI entailment pairs (ta) + Samanantar parallel corpus (~50K pairs) | | Loss function | MatryoshkaLoss + MultipleNegativesRankingLoss | ## Training Two-stage training pipeline: 1. **Stage 1 (NLI Warm-up):** Fine-tune on Tamil NLI entailment pairs (ANLI, FEVER, LING, MNLI, WANLI) with MatryoshkaLoss wrapping MultipleNegativesRankingLoss 2. **Stage 2 (Retrieval):** Fine-tune on Samanantar English-Tamil parallel corpus with hard negatives ## MTEB Results IndicCrosslingualSTS benchmark (Spearman correlation): | Language Pair | Score | |---------------|-------| | en-hi (Hindi) | 0.640 | | en-kn (Kannada) | 0.584 | | en-ml (Malayalam) | 0.582 | | en-bn (Bengali) | 0.537 | | en-pa (Punjabi) | 0.536 | | en-gu (Gujarati) | 0.533 | | en-as (Assamese) | 0.512 | | **en-ta (Tamil)** | **0.489** | | en-mr (Marathi) | 0.485 | | en-te (Telugu) | 0.468 | ## Usage ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("Tamil-ai/tamil-embed-base") sentences = [ "query: தமிழ் மொழியின் வரலாறு என்ன?", "passage: தமிழ் மொழி 2000 ஆண்டுகளுக்கும் மேலான வரலாற்றைக் கொண்ட செம்மொழியாகும்.", "passage: Python is a popular programming language.", ] embeddings = model.encode(sentences) print(embeddings.shape) # (3, 768) # Compute similarity from sentence_transformers.util import cos_sim similarities = cos_sim(embeddings[0], embeddings[1:]) print(similarities) # Tamil passage should score higher ``` ### Matryoshka (variable dimensions) ```python # Use smaller dimensions for faster search with minimal quality loss embeddings_256 = model.encode(sentences, output_value="sentence_embedding")[:, :256] embeddings_128 = model.encode(sentences, output_value="sentence_embedding")[:, :128] ``` ## Intended Use - Tamil semantic search and retrieval - Cross-lingual English-Tamil similarity - Tamil document clustering - RAG (Retrieval Augmented Generation) for Tamil ## Citation ```bibtex @misc{tamilai2026embed, title={A Thousand Language Problem: Morphological Understanding in Linguistic AI}, author={Tamil-AI}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/Tamil-ai/tamil-embed-base} } ```