hilarl's picture
Upload README.md with huggingface_hub
a43225b verified
metadata
license: apache-2.0
language:
  - dv
base_model: Qwen/Qwen3-Embedding-0.6B
tags:
  - dhivehi
  - maldives
  - embeddings
  - sentence-similarity
  - semantic-search
  - naturecode
library_name: transformers
pipeline_tag: feature-extraction

Naturecode Dhivehi Embeddings

High-quality text embeddings for Dhivehi language.

Model Description

Naturecode Dhivehi Embeddings is a fine-tuned embedding model optimized for Dhivehi text. Built on Qwen3-Embedding-0.6B, this model produces dense vector representations that capture semantic meaning in Dhivehi, enabling powerful semantic search and text similarity applications.

Key Features

  • Dhivehi-Optimized: Trained specifically for Dhivehi semantic understanding
  • Semantic Search: Find similar Dhivehi documents by meaning, not just keywords
  • 1024-dim Embeddings: Rich representations for downstream tasks
  • Efficient: 0.6B parameters balanced for quality and speed

Usage

import torch
from transformers import AutoModel, AutoTokenizer
from peft import PeftModel

base_model = AutoModel.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True)
model = PeftModel.from_pretrained(base_model, "hilarl/naturecode-dhivehi-embeddings")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True)

def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
        embedding = outputs.last_hidden_state.mean(dim=1)
        embedding = embedding / torch.norm(embedding, dim=-1, keepdim=True)
    return embedding

# Example: Compare Dhivehi sentences
text1 = "މާލެ އަކީ ދިވެހިރާއްޖޭގެ ވެރިރަށް"  # Male is the capital of Maldives
text2 = "ދިވެހިރާއްޖެ އަކީ ރީތި ޤައުމެއް"      # Maldives is a beautiful country

emb1 = get_embedding(text1)
emb2 = get_embedding(text2)
similarity = torch.cosine_similarity(emb1, emb2).item()
print(f"Similarity: {similarity:.4f}")  # ~0.94 for related Dhivehi sentences

Performance

Text Pair Cosine Similarity
Dhivehi sentence 1 vs Dhivehi sentence 2 0.9375
Dhivehi sentence vs English sentence 0.2559

The model correctly identifies semantic similarity between related Dhivehi texts.

Training Details

  • Base Model: Qwen/Qwen3-Embedding-0.6B
  • Training Method: LoRA fine-tuning (5 phases)
  • Embedding Dimension: 1024

Intended Use

  • Semantic search in Dhivehi document collections
  • Dhivehi text clustering and classification
  • RAG (Retrieval Augmented Generation) for Dhivehi
  • Duplicate detection in Dhivehi content

Limitations

  • Optimized for Dhivehi; cross-lingual performance may vary
  • Best for sentence/paragraph level embeddings
  • Maximum sequence length inherited from base model

License

Apache 2.0

Citation

@misc{naturecode-dhivehi-embeddings,
  author = {Naturecode},
  title = {Naturecode Dhivehi Embeddings},
  year = {2025},
  publisher = {HuggingFace},
}