triBne-e5-small — Banglish / Bangla / English

A sentence-embedding model fine-tuned from intfloat/multilingual-e5-small for robust semantic retrieval across Bangla (Bengali script), English, and Banglish (romanized Bengali) — including resilience to the heavy spelling variation that romanized Bengali exhibits (e.g. bhalobashi ↔ valobashi).

The base model collapses spelling variants and cross-script pairs to near-uniform similarity (it cannot tell a true positive from a hard negative). This fine-tune restores a large, statistically significant positive/negative margin while keeping the model tiny (~118M params, 384-dim embeddings, runs on CPU).

Base model: intfloat/multilingual-e5-small (XLM-RoBERTa backbone, 384-dim)
Method: LoRA contrastive fine-tuning, adapter merged into the backbone
Embedding dim: 384 · Max sequence length: 128 · Similarity: cosine
Pooling: mean pooling + L2 normalization

Usage

The model was fine-tuned on raw text pairs without the e5 query: / passage: prefixes — so you do not need them. Just encode raw strings.

sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("istiaqfuad/triBne-e5-small")

sentences = [
    "ami tomake bhalobashi",      # Banglish
    "ami tomake valobashi",       # Banglish spelling variant
    "আমি তোমাকে ভালোবাসি",         # Bangla
    "I love you",                 # English
]
emb = model.encode(sentences, normalize_embeddings=True)
print(model.similarity(emb, emb))

transformers (manual mean pooling)

import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

tok = AutoTokenizer.from_pretrained("istiaqfuad/triBne-e5-small")
model = AutoModel.from_pretrained("istiaqfuad/triBne-e5-small")

def encode(texts):
    batch = tok(texts, padding=True, truncation=True, max_length=128, return_tensors="pt")
    with torch.no_grad():
        out = model(**batch)
    mask = batch["attention_mask"].unsqueeze(-1).float()
    emb = (out.last_hidden_state * mask).sum(1) / mask.sum(1)   # mean pooling
    return F.normalize(emb, p=2, dim=1)

emb = encode(["ami tomake bhalobashi", "আমি তোমাকে ভালোবাসি"])
print((emb[0] @ emb[1]).item())

Evaluation

Evaluated on a held-out test set across three retrieval tasks: banglish_spelling (romanized spelling-variant matching), cross_script (Bangla ↔ Banglish), and en_bn (English ↔ Bangla). Compared against the base model and five strong multilingual baselines (BGE-M3, LaBSE, EmbeddingGemma-300m, and Qwen3-Embedding-0.6B). This model wins on every Banglish-specific task (banglish_spelling, cross_script) and on overall MRR@10; on the EN-BN control task it is competitive — LaBSE is marginally ahead on MRR@10 (0.922 vs 0.907), a gap that is not statistically significant (Δ = −0.015, p = 0.996).

Overall retrieval (mean over the three tasks):

Model	MRR@10	NDCG@10	Recall@10
This model (e5-small, fine-tuned)	0.926	0.937	0.970
Multilingual E5 (base)	0.644	0.673	0.768
BGE-M3	0.661	0.683	0.753
LaBSE	0.626	0.648	0.720
Qwen3-Embedding-0.6B	0.613	0.644	0.743
EmbeddingGemma-300m	0.593	0.619	0.702

Per-task MRR@10 — this model vs. base:

Task	This model	Base e5
banglish_spelling	0.890	0.616
cross_script (bn↔banglish)	0.981	0.498
en_bn	0.907	0.818

Positive/negative cosine margin (higher = better separation). On banglish spelling variants the base model has a negative margin (−0.011, i.e. it cannot separate positives from hard negatives); this model reaches +0.310 (p ≈ 0, Wilcoxon). Cross-script margin: +0.353 vs base −0.047.

Training

Objective: MultipleNegativesRankingLoss (in-batch negatives), scale 20.0
PEFT: LoRA — r=32, alpha=64, dropout=0.1, targets query,key,value,dense, task_type=FEATURE_EXTRACTION (adapter merged into the backbone for this release)
Epochs: 3 · LR: 3e-5 · Warmup: 500 steps · Optimizer: adamw_8bit
Effective batch: 128 (64/GPU × 2× T4, fp16, gradient checkpointing)
Max sequence length: 128 · final train loss ≈ 0.40
Data: istiaqfuad/bangla-english-banglish-pairs — ~2.4M contrastive pairs (LLM-generated Banglish spelling variants + cross-script pairs, plus OPUS-100 English↔Bangla), interleaved 80% Banglish / 20% English–Bangla.

Intended use & limitations

Use for: semantic search / retrieval, clustering, and similarity over mixed Bangla / English / romanized-Bengali text — especially noisy, user-generated romanized Bengali with inconsistent spelling.
Limitations: trained primarily on short text (≤128 tokens); longer inputs are truncated. Banglish training pairs are partly LLM-generated and may carry their biases. Not built for classification or generation.

Citation

If you use this model, please cite the base model (Wang et al., Multilingual E5) and this fine-tune:

@misc{tribne-e5-small,
  title  = {triBne-e5-small: multilingual-e5-small fine-tuned for Banglish/Bangla/English retrieval},
  author = {Istiaqur Rahman Fuad},
  year   = {2026},
  url    = {https://huggingface.co/istiaqfuad/triBne-e5-small}
}

This model is released under the MIT license, following the base model.

Downloads last month: 111

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for istiaqfuad/triBne-e5-small

Base model

intfloat/multilingual-e5-small

Adapter

(1)

this model

Dataset used to train istiaqfuad/triBne-e5-small

Space using istiaqfuad/triBne-e5-small 1

Paper for istiaqfuad/triBne-e5-small

Multilingual E5 Text Embeddings: A Technical Report

Paper • 2402.05672 • Published Feb 8, 2024 • 23