DhiGemma 12B
DhiGemma 12B is the first large-scale Dhivehi (Maldivian) language model, built by fine-tuning Google's Gemma 3 12B through an 8-phase progressive training curriculum. The model understands and generates fluent Dhivehi text written in Thaana script, and handles conversational Q&A, cultural knowledge, creative writing, and more.
Model Details
| Property | Value |
|---|---|
| Base Model | google/gemma-3-12b-it (12.5B parameters) |
| Architecture | Gemma 3, 48 layers, 262K vocabulary |
| Modality | Multimodal (vision tower preserved but untrained) |
| Languages | Dhivehi (dv), English (en) |
| License | Gemma License |
| Precision | bfloat16 |
| Context Length | 4096 tokens |
| Developer | Naturecode |
Training
Training Infrastructure
- Hardware: 8x NVIDIA H100 80GB GPUs (Google Cloud
a3-highgpu-8g) - Framework: Axolotl 0.13.2 with LoRA adaptation
- Parallelism: DDP (Distributed Data Parallel) via Accelerate, DeepSpeed ZeRO-2 for large-scale phases
- Software: Transformers 4.57.6, PyTorch 2.9.1, Accelerate 1.12.0, PEFT
Training Methodology
DhiGemma was trained using a progressive 8-phase curriculum that builds Dhivehi language capability incrementally:
Phase 1-3: Knowledge Foundation (Continued Pre-Training)
- Phase 1: Thaana script fundamentals -- alphabet, grammar rules, writing system (~2,200 examples)
- Phase 2: Dictionary knowledge -- word definitions, proverbs, cultural expressions (~5,800 examples)
- Phase 3: Formal Dhivehi text -- encyclopedic articles, news, legal documents, government publications (~64,000 examples)
Phase 4: Large-Scale Continued Pre-Training
- Web-crawled Dhivehi sentences and informal text (~7M examples, 610MB)
- Completion format, 1 epoch, sequence length 2048
Phase 5: Supervised Fine-Tuning
- Phase 5a: Domain-specific Q&A, chain-of-thought reasoning, cultural knowledge (~74,000 examples)
- Phase 5b: High-quality synthetic conversational data across 20+ topics (~100,000 examples)
- Chat template format (Gemma native
user/modelroles), trained on assistant turns only
Phase 6: Direct Preference Optimization (DPO)
- Chosen vs. rejected response pairs for alignment (~10,000 preference pairs)
- Beta=0.1, trained on Phase 5b merged checkpoint
Phase 7: Concise Response SFT
- Short, direct answers to teach focused response generation (~3,600 examples)
- Addresses verbosity from earlier SFT phases
Phase 8: SimPO (Simple Preference Optimization)
- Final preference alignment for response quality (~260 preference pairs)
- Custom training script via Accelerate
LoRA Configuration (All Phases)
adapter: lora
lora_r: 64
lora_alpha: 128
lora_dropout: 0.05
lora_target_linear: true # All linear layers
- Trainable parameters: 296M / 12.5B total (2.38%)
- Precision: bfloat16 with TF32
- Gradient checkpointing: Enabled
Training Data Summary
| Data Type | Examples | Description |
|---|---|---|
| Thaana script curriculum | ~2,200 | Alphabet, grammar, writing system fundamentals |
| Dictionary entries & proverbs | ~5,800 | Word definitions, cultural expressions |
| Encyclopedic articles | ~63,000 | Formal Dhivehi text from various domains |
| Web-crawled sentences | ~7,059,000 | Large-scale informal Dhivehi text |
| Informal blog content | ~11,600 | Cultural and everyday writing |
| Domain-specific Q&A | ~17,000 | Structured Q&A across 20+ topics |
| Chain-of-thought reasoning | ~43,400 | Multi-step reasoning examples |
| Cultural knowledge | ~2,400 | History, geography, food, crafts, traditions |
| Synthetic SFT conversations | ~100,000 | High-quality multi-turn conversations |
| Preference pairs (DPO) | ~10,000 | Chosen vs. rejected responses |
| Concise response SFT | ~3,600 | Short, focused answer examples |
| Preference pairs (SimPO) | ~260 | Final alignment pairs |
| Total | ~7,318,000 | ~930 MB |
Key Technical Decisions
- No flash attention: Gemma 3 architecture requires SDP attention
- DDP find_unused_parameters: Required because the vision tower doesn't participate in text-only forward passes
- Sample packing: Enabled for all SFT phases, disabled for DPO/SimPO
- LoRA preserves multimodality: Base weights frozen, vision tower untouched
- Progressive curriculum: Knowledge is built incrementally from script basics to conversational fluency
Usage
With Transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "naturecodeproject/DhiGemma-12B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "user", "content": "ދިވެހިރާއްޖޭގެ ވެރިރަށަކީ ކޮބާ؟"},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
outputs = model.generate(inputs, max_new_tokens=512, temperature=0.5, top_p=0.85, repetition_penalty=1.15)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
With SGLang (Recommended for Production)
python3 -m sglang.launch_server \
--model-path naturecodeproject/DhiGemma-12B \
--tp 8 \
--dtype bfloat16 \
--mem-fraction-static 0.85 \
--context-length 4096
With vLLM
vllm serve naturecodeproject/DhiGemma-12B \
--tensor-parallel-size 8 \
--dtype bfloat16 \
--max-model-len 4096
Inference Parameters
Recommended generation parameters:
| Parameter | Value |
|---|---|
temperature |
0.5 |
top_p |
0.85 |
repetition_penalty |
1.15 |
max_tokens |
512 (normal), 4096 (long-form) |
Chat Template
DhiGemma uses Gemma 3's native chat template with user and model roles. There is no system role.
Recommended approach: Use a multi-turn pattern where system instructions are provided as a separate user turn, followed by a short assistant acknowledgment, then the actual user query. This prevents the model from echoing or repeating system instructions in its response.
<start_of_turn>user
[System instructions here]<end_of_turn>
<start_of_turn>model
ލައްބަ.<end_of_turn>
<start_of_turn>user
[Actual user question]<end_of_turn>
<start_of_turn>model
Example (OpenAI-compatible API):
{
"messages": [
{"role": "user", "content": "ތިބާ އަކީ ދިވެހި ބަހުގެ އެހީތެރިއެކެވެ. ސުވާލަށް ސީދާ، ކުރު، ފުރިހަމަ ޖަވާބެއް ދޭށެވެ."},
{"role": "assistant", "content": "ލައްބަ."},
{"role": "user", "content": "ދިވެހިރާއްޖޭގެ ވެރިރަށަކީ ކޮބާ؟"}
]
}
Note: Concatenating system instructions directly into the user message (e.g.,
"[instructions]\n\n[question]") can cause the model to echo the instructions or restate the question before answering. The multi-turn separation avoids this.
Capabilities
- Dhivehi conversational Q&A
- Thaana script reading and writing
- Maldivian cultural knowledge (history, geography, food, crafts, traditions)
- Dhivehi grammar and language explanations
- Creative writing in Dhivehi (essays, letters, stories)
- Basic reasoning and chain-of-thought in Dhivehi
- Code-switching between Dhivehi and English
Limitations
- No formal benchmark: Dhivehi language benchmarks do not currently exist. Quality was evaluated qualitatively.
- Repetition: Like other Gemma 3 models, may occasionally produce repetitive text, especially for long-form generation. Use
repetition_penalty=1.15to mitigate. - Knowledge cutoff: Training data primarily covers content available up to early 2025.
- Vision: While the vision tower is preserved in the weights, it was not fine-tuned for Dhivehi visual tasks.
- Context: Optimized for 4096 token context length.
Hardware Requirements
| Configuration | Memory | Notes |
|---|---|---|
| BF16 (full precision) | ~24 GB | Requires 1x A100 80GB or 2x RTX 4090 |
| 8-bit quantization | ~12 GB | Single RTX 4090 |
| 4-bit quantization | ~6 GB | Consumer GPUs (RTX 3060+) |
| Multi-GPU (TP=8) | ~3 GB/GPU | Production deployment on H100s |
Citation
If you use DhiGemma in your research or applications, please cite:
@misc{dhigemma2025,
title={DhiGemma 12B: A Dhivehi Language Model},
author={Naturecode},
year={2025},
url={https://huggingface.co/naturecodeproject/DhiGemma-12B}
}
Acknowledgments
Built on Google's Gemma 3 architecture. Training infrastructure provided by Google Cloud.
- Downloads last month
- -