You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

DhiGemma 12B

DhiGemma 12B is the first large-scale Dhivehi (Maldivian) language model, built by fine-tuning Google's Gemma 3 12B through an 8-phase progressive training curriculum. The model understands and generates fluent Dhivehi text written in Thaana script, and handles conversational Q&A, cultural knowledge, creative writing, and more.

Model Details

Property	Value
Base Model	`google/gemma-3-12b-it` (12.5B parameters)
Architecture	Gemma 3, 48 layers, 262K vocabulary
Modality	Multimodal (vision tower preserved but untrained)
Languages	Dhivehi (dv), English (en)
License	Gemma License
Precision	bfloat16
Context Length	4096 tokens
Developer	Naturecode

Training

Training Infrastructure

Hardware: 8x NVIDIA H100 80GB GPUs (Google Cloud a3-highgpu-8g)
Framework: Axolotl 0.13.2 with LoRA adaptation
Parallelism: DDP (Distributed Data Parallel) via Accelerate, DeepSpeed ZeRO-2 for large-scale phases
Software: Transformers 4.57.6, PyTorch 2.9.1, Accelerate 1.12.0, PEFT

Training Methodology

DhiGemma was trained using a progressive 8-phase curriculum that builds Dhivehi language capability incrementally:

Phase 1-3: Knowledge Foundation (Continued Pre-Training)

Phase 1: Thaana script fundamentals -- alphabet, grammar rules, writing system (~2,200 examples)
Phase 2: Dictionary knowledge -- word definitions, proverbs, cultural expressions (~5,800 examples)
Phase 3: Formal Dhivehi text -- encyclopedic articles, news, legal documents, government publications (~64,000 examples)

Phase 4: Large-Scale Continued Pre-Training

Web-crawled Dhivehi sentences and informal text (~7M examples, 610MB)
Completion format, 1 epoch, sequence length 2048

Phase 5: Supervised Fine-Tuning

Phase 5a: Domain-specific Q&A, chain-of-thought reasoning, cultural knowledge (~74,000 examples)
Phase 5b: High-quality synthetic conversational data across 20+ topics (~100,000 examples)
Chat template format (Gemma native user/model roles), trained on assistant turns only

Phase 6: Direct Preference Optimization (DPO)

Chosen vs. rejected response pairs for alignment (~10,000 preference pairs)
Beta=0.1, trained on Phase 5b merged checkpoint

Phase 7: Concise Response SFT

Short, direct answers to teach focused response generation (~3,600 examples)
Addresses verbosity from earlier SFT phases

Phase 8: SimPO (Simple Preference Optimization)

Final preference alignment for response quality (~260 preference pairs)
Custom training script via Accelerate

LoRA Configuration (All Phases)

adapter: lora
lora_r: 64
lora_alpha: 128
lora_dropout: 0.05
lora_target_linear: true  # All linear layers

Trainable parameters: 296M / 12.5B total (2.38%)
Precision: bfloat16 with TF32
Gradient checkpointing: Enabled

Training Data Summary

Data Type	Examples	Description
Thaana script curriculum	~2,200	Alphabet, grammar, writing system fundamentals
Dictionary entries & proverbs	~5,800	Word definitions, cultural expressions
Encyclopedic articles	~63,000	Formal Dhivehi text from various domains
Web-crawled sentences	~7,059,000	Large-scale informal Dhivehi text
Informal blog content	~11,600	Cultural and everyday writing
Domain-specific Q&A	~17,000	Structured Q&A across 20+ topics
Chain-of-thought reasoning	~43,400	Multi-step reasoning examples
Cultural knowledge	~2,400	History, geography, food, crafts, traditions
Synthetic SFT conversations	~100,000	High-quality multi-turn conversations
Preference pairs (DPO)	~10,000	Chosen vs. rejected responses
Concise response SFT	~3,600	Short, focused answer examples
Preference pairs (SimPO)	~260	Final alignment pairs
Total	~7,318,000	~930 MB

Key Technical Decisions

No flash attention: Gemma 3 architecture requires SDP attention
DDP find_unused_parameters: Required because the vision tower doesn't participate in text-only forward passes
Sample packing: Enabled for all SFT phases, disabled for DPO/SimPO
LoRA preserves multimodality: Base weights frozen, vision tower untouched
Progressive curriculum: Knowledge is built incrementally from script basics to conversational fluency

Usage

With Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "naturecodeproject/DhiGemma-12B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "user", "content": "ދިވެހިރާއްޖޭގެ ވެރިރަށަކީ ކޮބާ؟"},
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
outputs = model.generate(inputs, max_new_tokens=512, temperature=0.5, top_p=0.85, repetition_penalty=1.15)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

With SGLang (Recommended for Production)

python3 -m sglang.launch_server \
    --model-path naturecodeproject/DhiGemma-12B \
    --tp 8 \
    --dtype bfloat16 \
    --mem-fraction-static 0.85 \
    --context-length 4096

With vLLM

vllm serve naturecodeproject/DhiGemma-12B \
    --tensor-parallel-size 8 \
    --dtype bfloat16 \
    --max-model-len 4096

Inference Parameters

Recommended generation parameters:

Parameter	Value
`temperature`	0.5
`top_p`	0.85
`repetition_penalty`	1.15
`max_tokens`	512 (normal), 4096 (long-form)

Chat Template

DhiGemma uses Gemma 3's native chat template with user and model roles. There is no system role.

Recommended approach: Use a multi-turn pattern where system instructions are provided as a separate user turn, followed by a short assistant acknowledgment, then the actual user query. This prevents the model from echoing or repeating system instructions in its response.

<start_of_turn>user
[System instructions here]<end_of_turn>
<start_of_turn>model
ލައްބަ.<end_of_turn>
<start_of_turn>user
[Actual user question]<end_of_turn>
<start_of_turn>model

Example (OpenAI-compatible API):

{
  "messages": [
    {"role": "user", "content": "ތިބާ އަކީ ދިވެހި ބަހުގެ އެހީތެރިއެކެވެ. ސުވާލަށް ސީދާ، ކުރު، ފުރިހަމަ ޖަވާބެއް ދޭށެވެ."},
    {"role": "assistant", "content": "ލައްބަ."},
    {"role": "user", "content": "ދިވެހިރާއްޖޭގެ ވެރިރަށަކީ ކޮބާ؟"}
  ]
}

Note: Concatenating system instructions directly into the user message (e.g., "[instructions]\n\n[question]") can cause the model to echo the instructions or restate the question before answering. The multi-turn separation avoids this.

Capabilities

Dhivehi conversational Q&A
Thaana script reading and writing
Maldivian cultural knowledge (history, geography, food, crafts, traditions)
Dhivehi grammar and language explanations
Creative writing in Dhivehi (essays, letters, stories)
Basic reasoning and chain-of-thought in Dhivehi
Code-switching between Dhivehi and English

Limitations

No formal benchmark: Dhivehi language benchmarks do not currently exist. Quality was evaluated qualitatively.
Repetition: Like other Gemma 3 models, may occasionally produce repetitive text, especially for long-form generation. Use repetition_penalty=1.15 to mitigate.
Knowledge cutoff: Training data primarily covers content available up to early 2025.
Vision: While the vision tower is preserved in the weights, it was not fine-tuned for Dhivehi visual tasks.
Context: Optimized for 4096 token context length.

Hardware Requirements

Configuration	Memory	Notes
BF16 (full precision)	~24 GB	Requires 1x A100 80GB or 2x RTX 4090
8-bit quantization	~12 GB	Single RTX 4090
4-bit quantization	~6 GB	Consumer GPUs (RTX 3060+)
Multi-GPU (TP=8)	~3 GB/GPU	Production deployment on H100s

Citation

If you use DhiGemma in your research or applications, please cite:

@misc{dhigemma2025,
  title={DhiGemma 12B: A Dhivehi Language Model},
  author={Naturecode},
  year={2025},
  url={https://huggingface.co/naturecodeproject/DhiGemma-12B}
}

Acknowledgments

Built on Google's Gemma 3 architecture. Training infrastructure provided by Google Cloud.

Downloads last month: -

Safetensors

Model size

12B params

Tensor type

BF16

Model tree for naturecodeproject/DhiGemma-12B

Base model

google/gemma-3-12b-pt

Finetuned

google/gemma-3-12b-it

Adapter

(103)

this model