mgpt2-sft — Multilingual GPT-2 (Instruction-Tuned)
mgpt2 fine-tuned on 30,000 multilingual instruction–response pairs across 5 language variants:
English, Hindi (Devanagari), Hindi (Latin transliteration), Kannada (Kannada script), and Kannada
(Latin transliteration). Training data from ai4bharat/indic-align (Anudesh, Dolly-T, OpenAssistant-T).
Built on top of the pretrained mgpt2 base — same 124M architecture, same custom multilingual tokenizer.
Uses masked cross-entropy (loss computed over response tokens only).
Quick start
import sys, torch
import torch.nn.functional as F
from huggingface_hub import snapshot_download
local = snapshot_download("ace-1/mgpt2-sft")
sys.path.insert(0, local)
from model import GPT
from tokenizer.regex_tokenizer import RegexTokenizer
ckpt = torch.load(f"{local}/pytorch_model.pt", weights_only=False, map_location="cpu")
model = GPT(ckpt["config"])
model.load_state_dict(ckpt["model"])
model.eval()
enc = RegexTokenizer()
enc.load(f"{local}/tokenizer/artifacts/mgpt2.model")
# Prompt: plain text, no special template needed
prompts = [
"What is the capital of Karnataka?", # English
"कर्नाटक की राजधानी क्या है?", # Hindi (Devanagari)
"ಕರ್ನಾಟಕದ ರಾಜಧಾನಿ ಯಾವುದು?", # Kannada script
]
for prompt in prompts:
ids = enc.encode(prompt)
x = torch.tensor(ids, dtype=torch.long).unsqueeze(0)
with torch.no_grad():
for _ in range(120):
logits, _ = model(x[:, -1024:])
probs = F.softmax(logits[:, -1, :] / 0.7, dim=-1)
next_id = torch.multinomial(probs, num_samples=1)
if next_id.item() == 50256: break
x = torch.cat([x, next_id], dim=1)
print(f"Prompt : {prompt}")
print(f"Response: {enc.decode(x[0, len(ids):].tolist())}")
print()
Intended use
Good for:
- Multilingual Q&A and instruction following (en/hi/kn, native + romanised scripts)
- Downstream fine-tuning starting point for Indic NLP tasks
- Research: multilingual instruction tuning at small scale
Not for: Safety-critical applications. Native-script variants (Devanagari, Kannada) are more reliable than transliterated Latin variants, which are prone to mid-generation script drift (known limitation — see training notes).
Model details
| Property | Value |
|---|---|
| Architecture | GPT-2 (12 layers / 12 heads / 768d) |
| Parameters | ~124M |
| Vocabulary | 50,257 (mgpt2 BPE) + padded to 50,304 |
| Context length | 1,024 tokens |
| Training stage | SFT (instruction-tuned) |
| Git commit | d07224070033 |
Training configuration
| Parameter | Value |
|---|---|
seed |
1337 |
batch_size |
64 |
micro_batch_size |
8 |
epochs |
3 |
warmup_steps |
50 |
max_lr |
0.0003 |
min_lr_ratio |
0.1 |
weight_decay |
0.1 |
eval_interval |
50 |
Evaluation
| Metric | Value | Notes |
|---|---|---|
| Val loss (masked CE) | 1.2404 | Response tokens only, held-out SFT set |
| Val PPL (SFT set) | 3.46 | Not comparable to pretrain LM PPL |
| Training steps | 1262 | 3 epochs over 30K examples |
SFT val PPL is measured on the SFT held-out set (narrower domain) and is not comparable to the pretrain LM eval PPL (12.4), which measures general language modelling ability.
Training data
| Language | Count | Source |
|---|---|---|
English (eng_Latn) |
16,500 | ai4bharat/indic-align Anudesh |
Hindi Devanagari (hin_Deva) |
5,400 | indic-align Dolly-T + OpenAssistant-T |
Kannada script (kan_Knda) |
3,900 | indic-align Dolly-T + OpenAssistant-T |
Hindi Latin translit (hin_Latn) |
2,100 | indic-align Dolly-T + OpenAssistant-T |
Kannada Latin translit (kan_Latn) |
2,100 | indic-align Dolly-T + OpenAssistant-T |
30,000 examples total. 90/10 train/val split. Masked CE — loss computed over response tokens only.
Tokenizer
Custom multilingual regex + BPE tokenizer (mgpt2), trained on the same corpus mixture.
Same vocabulary size as tiktoken-gpt2 (50,257 tokens), but with Indic-aware merge priorities:
| Bucket | tiktoken-gpt2 | mgpt2 | Δ |
|---|---|---|---|
| Overall | 480 tok/kB | 223 tok/kB | −54% |
| Devanagari | 592 tok/kB | 215 tok/kB | −64% |
| Kannada | 981 tok/kB | 213 tok/kB | −78% |
| Latin | 257 tok/kB | 230 tok/kB | −10% |
Tokenizer published separately: ace-1/mgpt2-tokenizer
Known limitations
- Transliterated Latin script drift.
hin_Latnandkan_Latnmay switch scripts mid-generation. Cause: ASCII tokens shared with English; no Unicode anchor. Mitigated but not eliminated at this data scale. - 124M parameters. Factual accuracy and multi-step reasoning are limited.
- No safety alignment. The SFT model was trained on benign instruction data only; it may attempt to answer harmful prompts. Use the DPO variant for light safety alignment.
- Research checkpoint — not evaluated for production use.
Citation
@misc{mgpt2,
title = {mgpt2: Multilingual GPT-2 with custom Indic tokenizer},
year = {2026},
note = {Pretrain → SFT → DPO pipeline for English/Hindi/Kannada},
url = {https://huggingface.co/ace-1/mgpt2-sft}
}
- Downloads last month
- 159