Naturecode Dhivehi Embeddings
High-quality text embeddings for Dhivehi language.
Model Description
Naturecode Dhivehi Embeddings is a fine-tuned embedding model optimized for Dhivehi text. Built on Qwen3-Embedding-0.6B, this model produces dense vector representations that capture semantic meaning in Dhivehi, enabling powerful semantic search and text similarity applications.
Key Features
- Dhivehi-Optimized: Trained specifically for Dhivehi semantic understanding
- Semantic Search: Find similar Dhivehi documents by meaning, not just keywords
- 1024-dim Embeddings: Rich representations for downstream tasks
- Efficient: 0.6B parameters balanced for quality and speed
Usage
import torch
from transformers import AutoModel, AutoTokenizer
from peft import PeftModel
base_model = AutoModel.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True)
model = PeftModel.from_pretrained(base_model, "hilarl/naturecode-dhivehi-embeddings")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True)
def get_embedding(text):
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
embedding = outputs.last_hidden_state.mean(dim=1)
embedding = embedding / torch.norm(embedding, dim=-1, keepdim=True)
return embedding
# Example: Compare Dhivehi sentences
text1 = "މާލެ އަކީ ދިވެހިރާއްޖޭގެ ވެރިރަށް" # Male is the capital of Maldives
text2 = "ދިވެހިރާއްޖެ އަކީ ރީތި ޤައުމެއް" # Maldives is a beautiful country
emb1 = get_embedding(text1)
emb2 = get_embedding(text2)
similarity = torch.cosine_similarity(emb1, emb2).item()
print(f"Similarity: {similarity:.4f}") # ~0.94 for related Dhivehi sentences
Performance
| Text Pair | Cosine Similarity |
|---|---|
| Dhivehi sentence 1 vs Dhivehi sentence 2 | 0.9375 |
| Dhivehi sentence vs English sentence | 0.2559 |
The model correctly identifies semantic similarity between related Dhivehi texts.
Training Details
- Base Model: Qwen/Qwen3-Embedding-0.6B
- Training Method: LoRA fine-tuning (5 phases)
- Embedding Dimension: 1024
Intended Use
- Semantic search in Dhivehi document collections
- Dhivehi text clustering and classification
- RAG (Retrieval Augmented Generation) for Dhivehi
- Duplicate detection in Dhivehi content
Limitations
- Optimized for Dhivehi; cross-lingual performance may vary
- Best for sentence/paragraph level embeddings
- Maximum sequence length inherited from base model
License
Apache 2.0
Citation
@misc{naturecode-dhivehi-embeddings,
author = {Naturecode},
title = {Naturecode Dhivehi Embeddings},
year = {2025},
publisher = {HuggingFace},
}