mgpt2-dpo — Multilingual GPT-2 (Preference-Aligned)

Recommended model from this project. mgpt2-sft further aligned with Direct Preference Optimization (DPO, β=0.1) on 13,500 toxic preference pairs from ai4bharat/indic-align (HHRLHF-T). Chosen responses are Llama2-70B-Chat safety refusals; rejected responses are raw pretrained-model continuations.

DPO increased the log-probability of safety-refusal responses by +6.1% and decreased the log-probability of rejected responses by −16.8% relative to the SFT checkpoint. Generation comparisons show the SFT model attempts to comply with harmful prompts; the DPO model redirects. See the project report for full analysis.

Quick start

import sys, torch
import torch.nn.functional as F
from huggingface_hub import snapshot_download

local = snapshot_download("ace-1/mgpt2-dpo")
sys.path.insert(0, local)
from model import GPT
from tokenizer.regex_tokenizer import RegexTokenizer

ckpt = torch.load(f"{local}/pytorch_model.pt", weights_only=False, map_location="cpu")
model = GPT(ckpt["config"])
model.load_state_dict(ckpt["model"])
model.eval()

enc = RegexTokenizer()
enc.load(f"{local}/tokenizer/artifacts/mgpt2.model")

prompts = [
    "Explain what photosynthesis is.",             # English
    "प्रकाश संश्लेषण क्या है?",                    # Hindi (Devanagari)
    "ದ್ಯುತಿಸಂಶ್ಲೇಷಣೆ ಎಂದರೇನು?",                    # Kannada script
]

for prompt in prompts:
    ids = enc.encode(prompt)
    x = torch.tensor(ids, dtype=torch.long).unsqueeze(0)
    with torch.no_grad():
        for _ in range(120):
            logits, _ = model(x[:, -1024:])
            probs = F.softmax(logits[:, -1, :] / 0.7, dim=-1)
            next_id = torch.multinomial(probs, num_samples=1)
            if next_id.item() == 50256: break
            x = torch.cat([x, next_id], dim=1)
    print(f"Prompt : {prompt}")
    print(f"Response: {enc.decode(x[0, len(ids):].tolist())}")
    print()

Intended use

Good for:

  • Multilingual instruction following with light safety alignment (en/hi/kn)
  • Research: DPO alignment dynamics at 124M scale
  • Demo of end-to-end LLM pipeline: pretrain → SFT → DPO

Not for: Production safety-critical applications. Alignment is format-preference alignment (coherent refusals vs incoherent noise), not full safety alignment. At 124M parameters the pretrained model could not generate coherent harmful content, so the DPO preference signal is weaker than production RLHF setups.

Model details

Property Value
Architecture GPT-2 (12 layers / 12 heads / 768d)
Parameters ~124M
Vocabulary 50,257 (mgpt2 BPE) + padded to 50,304
Context length 1,024 tokens
Training stage DPO (preference-aligned)
Git commit e463752bb14b

Training configuration

Parameter Value
seed 1337
batch_size 32
micro_batch_size 4
beta 0.1
max_lr 1e-06
min_lr_ratio 0.1
warmup_steps 20
epochs 1
weight_decay 0.1
eval_interval 50

Evaluation

Metric Value Notes
Preference win-rate 1.000 Held-out DPO pairs (n=1,496)
DPO val loss ~0 Training converged fully
SFT loss regression +1.2% Within 5% threshold (regression_ok=True)
Chosen log-p Δ +6.1% vs SFT checkpoint on same pairs
Rejected log-p Δ −16.8% vs SFT checkpoint on same pairs
Preference margin Δ +29.1% chosen − rejected margin widened

100% win-rate reflects format-preference alignment (coherent refusals vs word-salad), not full safety alignment. See project report for full generation comparison.

Training data

Language Count Chosen source Rejected source
English (eng_Latn) 8,250 Llama2-70B-Chat safety refusals Phase C pretrained mgpt2
Hindi Devanagari (hin_Deva) 2,700 IndicTrans2-translated refusals Phase C pretrained mgpt2
Kannada script (kan_Knda) 1,950 IndicTrans2-translated refusals Phase C pretrained mgpt2
Hindi Latin (hin_Latn) 1,050 IndicTrans2 romanisation Phase C pretrained mgpt2
Kannada Latin (kan_Latn) 1,050 IndicTrans2 romanisation Phase C pretrained mgpt2

13,500 train / 1,499 val pairs. Source: ai4bharat/indic-align HHRLHF-T config.

Tokenizer

Custom multilingual regex + BPE tokenizer (mgpt2), trained on the same corpus mixture. Same vocabulary size as tiktoken-gpt2 (50,257 tokens), but with Indic-aware merge priorities:

Bucket tiktoken-gpt2 mgpt2 Δ
Overall 480 tok/kB 223 tok/kB −54%
Devanagari 592 tok/kB 215 tok/kB −64%
Kannada 981 tok/kB 213 tok/kB −78%
Latin 257 tok/kB 230 tok/kB −10%

Tokenizer published separately: ace-1/mgpt2-tokenizer

Known limitations

  • Format-preference alignment, not full safety alignment. At 124M parameters, the pretrained model generates incoherent text for toxic prompts, so the DPO preference signal trains format preference (coherent refusals vs noise) rather than genuine safety reasoning.
  • Transliterated Latin script drift (inherited from SFT checkpoint) — hin_Latn/kan_Latn may switch scripts mid-generation.
  • 124M parameters. Factual accuracy and multi-step reasoning are limited.
  • Research checkpoint — not evaluated for production use.

Citation

@misc{mgpt2,
  title     = {mgpt2: Multilingual GPT-2 with custom Indic tokenizer},
  year      = {2026},
  note      = {Pretrain → SFT → DPO pipeline for English/Hindi/Kannada},
  url       = {https://huggingface.co/ace-1/mgpt2-dpo}
}
Downloads last month
153
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ace-1/mgpt2-dpo

Finetuned
ace-1/mgpt2-sft
Finetuned
(1)
this model