tamil-embed-base / README.md
mohanprakash462's picture
Upload README.md with huggingface_hub
a5b5aa2 verified
---
language:
- ta
- en
license: apache-2.0
base_model: intfloat/multilingual-e5-base
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- tamil
- embedding
- sentence-transformers
- matryoshka
- dravidian
- cross-lingual
model-index:
- name: Tamil-Embed-Base
results:
- task:
type: STS
dataset:
name: IndicCrosslingualSTS (en-ta)
type: mteb/IndicCrosslingualSTS
metrics:
- type: spearman
value: 0.489
name: Spearman (en-ta)
---
# Tamil-Embed-Base
A Tamil-specialized sentence embedding model fine-tuned from multilingual-e5-base (278M parameters) using Matryoshka representation learning.
**Paper:** *"A Thousand Language Problem: Morphological Understanding in Linguistic AI"*
## Model Details
| Property | Value |
|----------|-------|
| Base model | [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) |
| Parameters | 278M |
| Embedding dimensions | 768 (supports Matryoshka: 768, 512, 256, 128, 64) |
| Training data | NLI entailment pairs (ta) + Samanantar parallel corpus (~50K pairs) |
| Loss function | MatryoshkaLoss + MultipleNegativesRankingLoss |
## Training
Two-stage training pipeline:
1. **Stage 1 (NLI Warm-up):** Fine-tune on Tamil NLI entailment pairs (ANLI, FEVER, LING, MNLI, WANLI) with MatryoshkaLoss wrapping MultipleNegativesRankingLoss
2. **Stage 2 (Retrieval):** Fine-tune on Samanantar English-Tamil parallel corpus with hard negatives
## MTEB Results
IndicCrosslingualSTS benchmark (Spearman correlation):
| Language Pair | Score |
|---------------|-------|
| en-hi (Hindi) | 0.640 |
| en-kn (Kannada) | 0.584 |
| en-ml (Malayalam) | 0.582 |
| en-bn (Bengali) | 0.537 |
| en-pa (Punjabi) | 0.536 |
| en-gu (Gujarati) | 0.533 |
| en-as (Assamese) | 0.512 |
| **en-ta (Tamil)** | **0.489** |
| en-mr (Marathi) | 0.485 |
| en-te (Telugu) | 0.468 |
## Usage
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Tamil-ai/tamil-embed-base")
sentences = [
"query: தமிழ் மொழியின் வரலாறு என்ன?",
"passage: தமிழ் மொழி 2000 ஆண்டுகளுக்கும் மேலான வரலாற்றைக் கொண்ட செம்மொழியாகும்.",
"passage: Python is a popular programming language.",
]
embeddings = model.encode(sentences)
print(embeddings.shape) # (3, 768)
# Compute similarity
from sentence_transformers.util import cos_sim
similarities = cos_sim(embeddings[0], embeddings[1:])
print(similarities) # Tamil passage should score higher
```
### Matryoshka (variable dimensions)
```python
# Use smaller dimensions for faster search with minimal quality loss
embeddings_256 = model.encode(sentences, output_value="sentence_embedding")[:, :256]
embeddings_128 = model.encode(sentences, output_value="sentence_embedding")[:, :128]
```
## Intended Use
- Tamil semantic search and retrieval
- Cross-lingual English-Tamil similarity
- Tamil document clustering
- RAG (Retrieval Augmented Generation) for Tamil
## Citation
```bibtex
@misc{tamilai2026embed,
title={A Thousand Language Problem: Morphological Understanding in Linguistic AI},
author={Tamil-AI},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/Tamil-ai/tamil-embed-base}
}
```