mgpt2-dpo — Multilingual GPT-2 (Preference-Aligned)

Recommended model from this project. mgpt2-sft further aligned with Direct Preference Optimization (DPO, β=0.1) on 13,500 toxic preference pairs from ai4bharat/indic-align (HHRLHF-T). Chosen responses are Llama2-70B-Chat safety refusals; rejected responses are raw pretrained-model continuations.

DPO increased the log-probability of safety-refusal responses by +6.1% and decreased the log-probability of rejected responses by −16.8% relative to the SFT checkpoint. Generation comparisons show the SFT model attempts to comply with harmful prompts; the DPO model redirects. See the project report for full analysis.

Quick start

import sys, torch
import torch.nn.functional as F
from huggingface_hub import snapshot_download

local = snapshot_download("ace-1/mgpt2-dpo")
sys.path.insert(0, local)
from model import GPT
from tokenizer.regex_tokenizer import RegexTokenizer

ckpt = torch.load(f"{local}/pytorch_model.pt", weights_only=False, map_location="cpu")
model = GPT(ckpt["config"])
model.load_state_dict(ckpt["model"])
model.eval()

enc = RegexTokenizer()
enc.load(f"{local}/tokenizer/artifacts/mgpt2.model")

prompts = [
    "Explain what photosynthesis is.",             # English
    "प्रकाश संश्लेषण क्या है?",                    # Hindi (Devanagari)
    "ದ್ಯುತಿಸಂಶ್ಲೇಷಣೆ ಎಂದರೇನು?",                    # Kannada script
]

for prompt in prompts:
    ids = enc.encode(prompt)
    x = torch.tensor(ids, dtype=torch.long).unsqueeze(0)
    with torch.no_grad():
        for _ in range(120):
            logits, _ = model(x[:, -1024:])
            probs = F.softmax(logits[:, -1, :] / 0.7, dim=-1)
            next_id = torch.multinomial(probs, num_samples=1)
            if next_id.item() == 50256: break
            x = torch.cat([x, next_id], dim=1)
    print(f"Prompt : {prompt}")
    print(f"Response: {enc.decode(x[0, len(ids):].tolist())}")
    print()

Intended use

Good for:

Multilingual instruction following with light safety alignment (en/hi/kn)
Research: DPO alignment dynamics at 124M scale
Demo of end-to-end LLM pipeline: pretrain → SFT → DPO

Not for: Production safety-critical applications. Alignment is format-preference alignment (coherent refusals vs incoherent noise), not full safety alignment. At 124M parameters the pretrained model could not generate coherent harmful content, so the DPO preference signal is weaker than production RLHF setups.

Model details

Property	Value
Architecture	GPT-2 (12 layers / 12 heads / 768d)
Parameters	~124M
Vocabulary	50,257 (mgpt2 BPE) + padded to 50,304
Context length	1,024 tokens
Training stage	DPO (preference-aligned)
Git commit	`e463752bb14b`

Training configuration

Parameter	Value
`seed`	`1337`
`batch_size`	`32`
`micro_batch_size`	`4`
`beta`	`0.1`
`max_lr`	`1e-06`
`min_lr_ratio`	`0.1`
`warmup_steps`	`20`
`epochs`	`1`
`weight_decay`	`0.1`
`eval_interval`	`50`

Evaluation

Metric	Value	Notes
Preference win-rate	1.000	Held-out DPO pairs (n=1,496)
DPO val loss	~0	Training converged fully
SFT loss regression	+1.2%	Within 5% threshold (regression_ok=True)
Chosen log-p Δ	+6.1%	vs SFT checkpoint on same pairs
Rejected log-p Δ	−16.8%	vs SFT checkpoint on same pairs
Preference margin Δ	+29.1%	chosen − rejected margin widened

100% win-rate reflects format-preference alignment (coherent refusals vs word-salad), not full safety alignment. See project report for full generation comparison.

Training data

Language	Count	Chosen source	Rejected source
English (`eng_Latn`)	8,250	Llama2-70B-Chat safety refusals	Phase C pretrained mgpt2
Hindi Devanagari (`hin_Deva`)	2,700	IndicTrans2-translated refusals	Phase C pretrained mgpt2
Kannada script (`kan_Knda`)	1,950	IndicTrans2-translated refusals	Phase C pretrained mgpt2
Hindi Latin (`hin_Latn`)	1,050	IndicTrans2 romanisation	Phase C pretrained mgpt2
Kannada Latin (`kan_Latn`)	1,050	IndicTrans2 romanisation	Phase C pretrained mgpt2

13,500 train / 1,499 val pairs. Source: ai4bharat/indic-align HHRLHF-T config.

Tokenizer

Custom multilingual regex + BPE tokenizer (mgpt2), trained on the same corpus mixture. Same vocabulary size as tiktoken-gpt2 (50,257 tokens), but with Indic-aware merge priorities:

Bucket	tiktoken-gpt2	mgpt2	Δ
Overall	480 tok/kB	223 tok/kB	−54%
Devanagari	592 tok/kB	215 tok/kB	−64%
Kannada	981 tok/kB	213 tok/kB	−78%
Latin	257 tok/kB	230 tok/kB	−10%

Tokenizer published separately: ace-1/mgpt2-tokenizer

Known limitations

Format-preference alignment, not full safety alignment. At 124M parameters, the pretrained model generates incoherent text for toxic prompts, so the DPO preference signal trains format preference (coherent refusals vs noise) rather than genuine safety reasoning.
Transliterated Latin script drift (inherited from SFT checkpoint) — hin_Latn/kan_Latn may switch scripts mid-generation.
124M parameters. Factual accuracy and multi-step reasoning are limited.
Research checkpoint — not evaluated for production use.

Citation

@misc{mgpt2,
  title     = {mgpt2: Multilingual GPT-2 with custom Indic tokenizer},
  year      = {2026},
  note      = {Pretrain → SFT → DPO pipeline for English/Hindi/Kannada},
  url       = {https://huggingface.co/ace-1/mgpt2-dpo}
}

Downloads last month: -

Model tree for ace-1/mgpt2-dpo

Base model

ace-1/mgpt2-pretrain

Finetuned

ace-1/mgpt2-sft

Finetuned

(1)

this model