mgpt2-sft — Multilingual GPT-2 (Instruction-Tuned)

mgpt2 fine-tuned on 30,000 multilingual instruction–response pairs across 5 language variants: English, Hindi (Devanagari), Hindi (Latin transliteration), Kannada (Kannada script), and Kannada (Latin transliteration). Training data from ai4bharat/indic-align (Anudesh, Dolly-T, OpenAssistant-T).

Built on top of the pretrained mgpt2 base — same 124M architecture, same custom multilingual tokenizer. Uses masked cross-entropy (loss computed over response tokens only).

Quick start

import sys, torch
import torch.nn.functional as F
from huggingface_hub import snapshot_download

local = snapshot_download("ace-1/mgpt2-sft")
sys.path.insert(0, local)
from model import GPT
from tokenizer.regex_tokenizer import RegexTokenizer

ckpt = torch.load(f"{local}/pytorch_model.pt", weights_only=False, map_location="cpu")
model = GPT(ckpt["config"])
model.load_state_dict(ckpt["model"])
model.eval()

enc = RegexTokenizer()
enc.load(f"{local}/tokenizer/artifacts/mgpt2.model")

# Prompt: plain text, no special template needed
prompts = [
    "What is the capital of Karnataka?",                   # English
    "कर्नाटक की राजधानी क्या है?",                          # Hindi (Devanagari)
    "ಕರ್ನಾಟಕದ ರಾಜಧಾನಿ ಯಾವುದು?",                           # Kannada script
]

for prompt in prompts:
    ids = enc.encode(prompt)
    x = torch.tensor(ids, dtype=torch.long).unsqueeze(0)
    with torch.no_grad():
        for _ in range(120):
            logits, _ = model(x[:, -1024:])
            probs = F.softmax(logits[:, -1, :] / 0.7, dim=-1)
            next_id = torch.multinomial(probs, num_samples=1)
            if next_id.item() == 50256: break
            x = torch.cat([x, next_id], dim=1)
    print(f"Prompt : {prompt}")
    print(f"Response: {enc.decode(x[0, len(ids):].tolist())}")
    print()

Intended use

Good for:

  • Multilingual Q&A and instruction following (en/hi/kn, native + romanised scripts)
  • Downstream fine-tuning starting point for Indic NLP tasks
  • Research: multilingual instruction tuning at small scale

Not for: Safety-critical applications. Native-script variants (Devanagari, Kannada) are more reliable than transliterated Latin variants, which are prone to mid-generation script drift (known limitation — see training notes).

Model details

Property Value
Architecture GPT-2 (12 layers / 12 heads / 768d)
Parameters ~124M
Vocabulary 50,257 (mgpt2 BPE) + padded to 50,304
Context length 1,024 tokens
Training stage SFT (instruction-tuned)
Git commit d07224070033

Training configuration

Parameter Value
seed 1337
batch_size 64
micro_batch_size 8
epochs 3
warmup_steps 50
max_lr 0.0003
min_lr_ratio 0.1
weight_decay 0.1
eval_interval 50

Evaluation

Metric Value Notes
Val loss (masked CE) 1.2404 Response tokens only, held-out SFT set
Val PPL (SFT set) 3.46 Not comparable to pretrain LM PPL
Training steps 1262 3 epochs over 30K examples

SFT val PPL is measured on the SFT held-out set (narrower domain) and is not comparable to the pretrain LM eval PPL (12.4), which measures general language modelling ability.

Training data

Language Count Source
English (eng_Latn) 16,500 ai4bharat/indic-align Anudesh
Hindi Devanagari (hin_Deva) 5,400 indic-align Dolly-T + OpenAssistant-T
Kannada script (kan_Knda) 3,900 indic-align Dolly-T + OpenAssistant-T
Hindi Latin translit (hin_Latn) 2,100 indic-align Dolly-T + OpenAssistant-T
Kannada Latin translit (kan_Latn) 2,100 indic-align Dolly-T + OpenAssistant-T

30,000 examples total. 90/10 train/val split. Masked CE — loss computed over response tokens only.

Tokenizer

Custom multilingual regex + BPE tokenizer (mgpt2), trained on the same corpus mixture. Same vocabulary size as tiktoken-gpt2 (50,257 tokens), but with Indic-aware merge priorities:

Bucket tiktoken-gpt2 mgpt2 Δ
Overall 480 tok/kB 223 tok/kB −54%
Devanagari 592 tok/kB 215 tok/kB −64%
Kannada 981 tok/kB 213 tok/kB −78%
Latin 257 tok/kB 230 tok/kB −10%

Tokenizer published separately: ace-1/mgpt2-tokenizer

Known limitations

  • Transliterated Latin script drift. hin_Latn and kan_Latn may switch scripts mid-generation. Cause: ASCII tokens shared with English; no Unicode anchor. Mitigated but not eliminated at this data scale.
  • 124M parameters. Factual accuracy and multi-step reasoning are limited.
  • No safety alignment. The SFT model was trained on benign instruction data only; it may attempt to answer harmful prompts. Use the DPO variant for light safety alignment.
  • Research checkpoint — not evaluated for production use.

Citation

@misc{mgpt2,
  title     = {mgpt2: Multilingual GPT-2 with custom Indic tokenizer},
  year      = {2026},
  note      = {Pretrain → SFT → DPO pipeline for English/Hindi/Kannada},
  url       = {https://huggingface.co/ace-1/mgpt2-sft}
}
Downloads last month
159
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ace-1/mgpt2-sft

Finetuned
(1)
this model
Finetunes
1 model