| | --- |
| | language: |
| | - ta |
| | - en |
| | license: apache-2.0 |
| | base_model: intfloat/multilingual-e5-base |
| | library_name: sentence-transformers |
| | pipeline_tag: sentence-similarity |
| | tags: |
| | - tamil |
| | - embedding |
| | - sentence-transformers |
| | - matryoshka |
| | - dravidian |
| | - cross-lingual |
| | model-index: |
| | - name: Tamil-Embed-Base |
| | results: |
| | - task: |
| | type: STS |
| | dataset: |
| | name: IndicCrosslingualSTS (en-ta) |
| | type: mteb/IndicCrosslingualSTS |
| | metrics: |
| | - type: spearman |
| | value: 0.489 |
| | name: Spearman (en-ta) |
| | --- |
| | |
| | # Tamil-Embed-Base |
| |
|
| | A Tamil-specialized sentence embedding model fine-tuned from multilingual-e5-base (278M parameters) using Matryoshka representation learning. |
| |
|
| | **Paper:** *"A Thousand Language Problem: Morphological Understanding in Linguistic AI"* |
| |
|
| | ## Model Details |
| |
|
| | | Property | Value | |
| | |----------|-------| |
| | | Base model | [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) | |
| | | Parameters | 278M | |
| | | Embedding dimensions | 768 (supports Matryoshka: 768, 512, 256, 128, 64) | |
| | | Training data | NLI entailment pairs (ta) + Samanantar parallel corpus (~50K pairs) | |
| | | Loss function | MatryoshkaLoss + MultipleNegativesRankingLoss | |
| |
|
| | ## Training |
| |
|
| | Two-stage training pipeline: |
| |
|
| | 1. **Stage 1 (NLI Warm-up):** Fine-tune on Tamil NLI entailment pairs (ANLI, FEVER, LING, MNLI, WANLI) with MatryoshkaLoss wrapping MultipleNegativesRankingLoss |
| | 2. **Stage 2 (Retrieval):** Fine-tune on Samanantar English-Tamil parallel corpus with hard negatives |
| |
|
| | ## MTEB Results |
| |
|
| | IndicCrosslingualSTS benchmark (Spearman correlation): |
| |
|
| | | Language Pair | Score | |
| | |---------------|-------| |
| | | en-hi (Hindi) | 0.640 | |
| | | en-kn (Kannada) | 0.584 | |
| | | en-ml (Malayalam) | 0.582 | |
| | | en-bn (Bengali) | 0.537 | |
| | | en-pa (Punjabi) | 0.536 | |
| | | en-gu (Gujarati) | 0.533 | |
| | | en-as (Assamese) | 0.512 | |
| | | **en-ta (Tamil)** | **0.489** | |
| | | en-mr (Marathi) | 0.485 | |
| | | en-te (Telugu) | 0.468 | |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | from sentence_transformers import SentenceTransformer |
| | |
| | model = SentenceTransformer("Tamil-ai/tamil-embed-base") |
| | |
| | sentences = [ |
| | "query: தமிழ் மொழியின் வரலாறு என்ன?", |
| | "passage: தமிழ் மொழி 2000 ஆண்டுகளுக்கும் மேலான வரலாற்றைக் கொண்ட செம்மொழியாகும்.", |
| | "passage: Python is a popular programming language.", |
| | ] |
| | |
| | embeddings = model.encode(sentences) |
| | print(embeddings.shape) # (3, 768) |
| | |
| | # Compute similarity |
| | from sentence_transformers.util import cos_sim |
| | similarities = cos_sim(embeddings[0], embeddings[1:]) |
| | print(similarities) # Tamil passage should score higher |
| | ``` |
| |
|
| | ### Matryoshka (variable dimensions) |
| |
|
| | ```python |
| | # Use smaller dimensions for faster search with minimal quality loss |
| | embeddings_256 = model.encode(sentences, output_value="sentence_embedding")[:, :256] |
| | embeddings_128 = model.encode(sentences, output_value="sentence_embedding")[:, :128] |
| | ``` |
| |
|
| | ## Intended Use |
| |
|
| | - Tamil semantic search and retrieval |
| | - Cross-lingual English-Tamil similarity |
| | - Tamil document clustering |
| | - RAG (Retrieval Augmented Generation) for Tamil |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @misc{tamilai2026embed, |
| | title={A Thousand Language Problem: Morphological Understanding in Linguistic AI}, |
| | author={Tamil-AI}, |
| | year={2026}, |
| | publisher={HuggingFace}, |
| | url={https://huggingface.co/Tamil-ai/tamil-embed-base} |
| | } |
| | ``` |
| |
|