You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Naturecode Dhivehi Embeddings

High-quality text embeddings for Dhivehi language.

Model Description

Naturecode Dhivehi Embeddings is a fine-tuned embedding model optimized for Dhivehi text. Built on Qwen3-Embedding-0.6B, this model produces dense vector representations that capture semantic meaning in Dhivehi, enabling powerful semantic search and text similarity applications.

Key Features

Dhivehi-Optimized: Trained specifically for Dhivehi semantic understanding
Semantic Search: Find similar Dhivehi documents by meaning, not just keywords
1024-dim Embeddings: Rich representations for downstream tasks
Efficient: 0.6B parameters balanced for quality and speed

Usage

import torch
from transformers import AutoModel, AutoTokenizer
from peft import PeftModel

base_model = AutoModel.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True)
model = PeftModel.from_pretrained(base_model, "hilarl/naturecode-dhivehi-embeddings")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True)

def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
        embedding = outputs.last_hidden_state.mean(dim=1)
        embedding = embedding / torch.norm(embedding, dim=-1, keepdim=True)
    return embedding

# Example: Compare Dhivehi sentences
text1 = "މާލެ އަކީ ދިވެހިރާއްޖޭގެ ވެރިރަށް"  # Male is the capital of Maldives
text2 = "ދިވެހިރާއްޖެ އަކީ ރީތި ޤައުމެއް"      # Maldives is a beautiful country

emb1 = get_embedding(text1)
emb2 = get_embedding(text2)
similarity = torch.cosine_similarity(emb1, emb2).item()
print(f"Similarity: {similarity:.4f}")  # ~0.94 for related Dhivehi sentences

Performance

Text Pair	Cosine Similarity
Dhivehi sentence 1 vs Dhivehi sentence 2	0.9375
Dhivehi sentence vs English sentence	0.2559

The model correctly identifies semantic similarity between related Dhivehi texts.

Training Details

Base Model: Qwen/Qwen3-Embedding-0.6B
Training Method: LoRA fine-tuning (5 phases)
Embedding Dimension: 1024

Intended Use

Semantic search in Dhivehi document collections
Dhivehi text clustering and classification
RAG (Retrieval Augmented Generation) for Dhivehi
Duplicate detection in Dhivehi content

Limitations

Optimized for Dhivehi; cross-lingual performance may vary
Best for sentence/paragraph level embeddings
Maximum sequence length inherited from base model

License

Apache 2.0

Citation

@misc{naturecode-dhivehi-embeddings,
  author = {Naturecode},
  title = {Naturecode Dhivehi Embeddings},
  year = {2025},
  publisher = {HuggingFace},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for hilarl/naturecode-dhivehi-embeddings

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-Embedding-0.6B

Finetuned

(170)

this model