You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Naturecode Dhivehi Embeddings

High-quality text embeddings for Dhivehi language.

Model Description

Naturecode Dhivehi Embeddings is a fine-tuned embedding model optimized for Dhivehi text. Built on Qwen3-Embedding-0.6B, this model produces dense vector representations that capture semantic meaning in Dhivehi, enabling powerful semantic search and text similarity applications.

Key Features

  • Dhivehi-Optimized: Trained specifically for Dhivehi semantic understanding
  • Semantic Search: Find similar Dhivehi documents by meaning, not just keywords
  • 1024-dim Embeddings: Rich representations for downstream tasks
  • Efficient: 0.6B parameters balanced for quality and speed

Usage

import torch
from transformers import AutoModel, AutoTokenizer
from peft import PeftModel

base_model = AutoModel.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True)
model = PeftModel.from_pretrained(base_model, "hilarl/naturecode-dhivehi-embeddings")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True)

def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
        embedding = outputs.last_hidden_state.mean(dim=1)
        embedding = embedding / torch.norm(embedding, dim=-1, keepdim=True)
    return embedding

# Example: Compare Dhivehi sentences
text1 = "މާލެ އަކީ ދިވެހިރާއްޖޭގެ ވެރިރަށް"  # Male is the capital of Maldives
text2 = "ދިވެހިރާއްޖެ އަކީ ރީތި ޤައުމެއް"      # Maldives is a beautiful country

emb1 = get_embedding(text1)
emb2 = get_embedding(text2)
similarity = torch.cosine_similarity(emb1, emb2).item()
print(f"Similarity: {similarity:.4f}")  # ~0.94 for related Dhivehi sentences

Performance

Text Pair Cosine Similarity
Dhivehi sentence 1 vs Dhivehi sentence 2 0.9375
Dhivehi sentence vs English sentence 0.2559

The model correctly identifies semantic similarity between related Dhivehi texts.

Training Details

  • Base Model: Qwen/Qwen3-Embedding-0.6B
  • Training Method: LoRA fine-tuning (5 phases)
  • Embedding Dimension: 1024

Intended Use

  • Semantic search in Dhivehi document collections
  • Dhivehi text clustering and classification
  • RAG (Retrieval Augmented Generation) for Dhivehi
  • Duplicate detection in Dhivehi content

Limitations

  • Optimized for Dhivehi; cross-lingual performance may vary
  • Best for sentence/paragraph level embeddings
  • Maximum sequence length inherited from base model

License

Apache 2.0

Citation

@misc{naturecode-dhivehi-embeddings,
  author = {Naturecode},
  title = {Naturecode Dhivehi Embeddings},
  year = {2025},
  publisher = {HuggingFace},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hilarl/naturecode-dhivehi-embeddings

Finetuned
(120)
this model