You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

DhiGemma 12B

DhiGemma 12B is the first large-scale Dhivehi (Maldivian) language model, built by fine-tuning Google's Gemma 3 12B through an 8-phase progressive training curriculum. The model understands and generates fluent Dhivehi text written in Thaana script, and handles conversational Q&A, cultural knowledge, creative writing, and more.

Model Details

Property Value
Base Model google/gemma-3-12b-it (12.5B parameters)
Architecture Gemma 3, 48 layers, 262K vocabulary
Modality Multimodal (vision tower preserved but untrained)
Languages Dhivehi (dv), English (en)
License Gemma License
Precision bfloat16
Context Length 4096 tokens
Developer Naturecode

Training

Training Infrastructure

  • Hardware: 8x NVIDIA H100 80GB GPUs (Google Cloud a3-highgpu-8g)
  • Framework: Axolotl 0.13.2 with LoRA adaptation
  • Parallelism: DDP (Distributed Data Parallel) via Accelerate, DeepSpeed ZeRO-2 for large-scale phases
  • Software: Transformers 4.57.6, PyTorch 2.9.1, Accelerate 1.12.0, PEFT

Training Methodology

DhiGemma was trained using a progressive 8-phase curriculum that builds Dhivehi language capability incrementally:

Phase 1-3: Knowledge Foundation (Continued Pre-Training)

  • Phase 1: Thaana script fundamentals -- alphabet, grammar rules, writing system (~2,200 examples)
  • Phase 2: Dictionary knowledge -- word definitions, proverbs, cultural expressions (~5,800 examples)
  • Phase 3: Formal Dhivehi text -- encyclopedic articles, news, legal documents, government publications (~64,000 examples)

Phase 4: Large-Scale Continued Pre-Training

  • Web-crawled Dhivehi sentences and informal text (~7M examples, 610MB)
  • Completion format, 1 epoch, sequence length 2048

Phase 5: Supervised Fine-Tuning

  • Phase 5a: Domain-specific Q&A, chain-of-thought reasoning, cultural knowledge (~74,000 examples)
  • Phase 5b: High-quality synthetic conversational data across 20+ topics (~100,000 examples)
  • Chat template format (Gemma native user/model roles), trained on assistant turns only

Phase 6: Direct Preference Optimization (DPO)

  • Chosen vs. rejected response pairs for alignment (~10,000 preference pairs)
  • Beta=0.1, trained on Phase 5b merged checkpoint

Phase 7: Concise Response SFT

  • Short, direct answers to teach focused response generation (~3,600 examples)
  • Addresses verbosity from earlier SFT phases

Phase 8: SimPO (Simple Preference Optimization)

  • Final preference alignment for response quality (~260 preference pairs)
  • Custom training script via Accelerate

LoRA Configuration (All Phases)

adapter: lora
lora_r: 64
lora_alpha: 128
lora_dropout: 0.05
lora_target_linear: true  # All linear layers
  • Trainable parameters: 296M / 12.5B total (2.38%)
  • Precision: bfloat16 with TF32
  • Gradient checkpointing: Enabled

Training Data Summary

Data Type Examples Description
Thaana script curriculum ~2,200 Alphabet, grammar, writing system fundamentals
Dictionary entries & proverbs ~5,800 Word definitions, cultural expressions
Encyclopedic articles ~63,000 Formal Dhivehi text from various domains
Web-crawled sentences ~7,059,000 Large-scale informal Dhivehi text
Informal blog content ~11,600 Cultural and everyday writing
Domain-specific Q&A ~17,000 Structured Q&A across 20+ topics
Chain-of-thought reasoning ~43,400 Multi-step reasoning examples
Cultural knowledge ~2,400 History, geography, food, crafts, traditions
Synthetic SFT conversations ~100,000 High-quality multi-turn conversations
Preference pairs (DPO) ~10,000 Chosen vs. rejected responses
Concise response SFT ~3,600 Short, focused answer examples
Preference pairs (SimPO) ~260 Final alignment pairs
Total ~7,318,000 ~930 MB

Key Technical Decisions

  1. No flash attention: Gemma 3 architecture requires SDP attention
  2. DDP find_unused_parameters: Required because the vision tower doesn't participate in text-only forward passes
  3. Sample packing: Enabled for all SFT phases, disabled for DPO/SimPO
  4. LoRA preserves multimodality: Base weights frozen, vision tower untouched
  5. Progressive curriculum: Knowledge is built incrementally from script basics to conversational fluency

Usage

With Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "naturecodeproject/DhiGemma-12B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "user", "content": "ދިވެހިރާއްޖޭގެ ވެރިރަށަކީ ކޮބާ؟"},
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
outputs = model.generate(inputs, max_new_tokens=512, temperature=0.5, top_p=0.85, repetition_penalty=1.15)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

With SGLang (Recommended for Production)

python3 -m sglang.launch_server \
    --model-path naturecodeproject/DhiGemma-12B \
    --tp 8 \
    --dtype bfloat16 \
    --mem-fraction-static 0.85 \
    --context-length 4096

With vLLM

vllm serve naturecodeproject/DhiGemma-12B \
    --tensor-parallel-size 8 \
    --dtype bfloat16 \
    --max-model-len 4096

Inference Parameters

Recommended generation parameters:

Parameter Value
temperature 0.5
top_p 0.85
repetition_penalty 1.15
max_tokens 512 (normal), 4096 (long-form)

Chat Template

DhiGemma uses Gemma 3's native chat template with user and model roles. There is no system role.

Recommended approach: Use a multi-turn pattern where system instructions are provided as a separate user turn, followed by a short assistant acknowledgment, then the actual user query. This prevents the model from echoing or repeating system instructions in its response.

<start_of_turn>user
[System instructions here]<end_of_turn>
<start_of_turn>model
ލައްބަ.<end_of_turn>
<start_of_turn>user
[Actual user question]<end_of_turn>
<start_of_turn>model

Example (OpenAI-compatible API):

{
  "messages": [
    {"role": "user", "content": "ތިބާ އަކީ ދިވެހި ބަހުގެ އެހީތެރިއެކެވެ. ސުވާލަށް ސީދާ، ކުރު، ފުރިހަމަ ޖަވާބެއް ދޭށެވެ."},
    {"role": "assistant", "content": "ލައްބަ."},
    {"role": "user", "content": "ދިވެހިރާއްޖޭގެ ވެރިރަށަކީ ކޮބާ؟"}
  ]
}

Note: Concatenating system instructions directly into the user message (e.g., "[instructions]\n\n[question]") can cause the model to echo the instructions or restate the question before answering. The multi-turn separation avoids this.

Capabilities

  • Dhivehi conversational Q&A
  • Thaana script reading and writing
  • Maldivian cultural knowledge (history, geography, food, crafts, traditions)
  • Dhivehi grammar and language explanations
  • Creative writing in Dhivehi (essays, letters, stories)
  • Basic reasoning and chain-of-thought in Dhivehi
  • Code-switching between Dhivehi and English

Limitations

  • No formal benchmark: Dhivehi language benchmarks do not currently exist. Quality was evaluated qualitatively.
  • Repetition: Like other Gemma 3 models, may occasionally produce repetitive text, especially for long-form generation. Use repetition_penalty=1.15 to mitigate.
  • Knowledge cutoff: Training data primarily covers content available up to early 2025.
  • Vision: While the vision tower is preserved in the weights, it was not fine-tuned for Dhivehi visual tasks.
  • Context: Optimized for 4096 token context length.

Hardware Requirements

Configuration Memory Notes
BF16 (full precision) ~24 GB Requires 1x A100 80GB or 2x RTX 4090
8-bit quantization ~12 GB Single RTX 4090
4-bit quantization ~6 GB Consumer GPUs (RTX 3060+)
Multi-GPU (TP=8) ~3 GB/GPU Production deployment on H100s

Citation

If you use DhiGemma in your research or applications, please cite:

@misc{dhigemma2025,
  title={DhiGemma 12B: A Dhivehi Language Model},
  author={Naturecode},
  year={2025},
  url={https://huggingface.co/naturecodeproject/DhiGemma-12B}
}

Acknowledgments

Built on Google's Gemma 3 architecture. Training infrastructure provided by Google Cloud.

Downloads last month
-
Safetensors
Model size
12B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for naturecodeproject/DhiGemma-12B

Adapter
(103)
this model