mgpt2-dpo — Multilingual GPT-2 (Preference-Aligned)
Recommended model from this project. mgpt2-sft further aligned with
Direct Preference Optimization (DPO, β=0.1) on 13,500 toxic preference pairs
from ai4bharat/indic-align (HHRLHF-T). Chosen responses are Llama2-70B-Chat
safety refusals; rejected responses are raw pretrained-model continuations.
DPO increased the log-probability of safety-refusal responses by +6.1% and decreased the log-probability of rejected responses by −16.8% relative to the SFT checkpoint. Generation comparisons show the SFT model attempts to comply with harmful prompts; the DPO model redirects. See the project report for full analysis.
Quick start
import sys, torch
import torch.nn.functional as F
from huggingface_hub import snapshot_download
local = snapshot_download("ace-1/mgpt2-dpo")
sys.path.insert(0, local)
from model import GPT
from tokenizer.regex_tokenizer import RegexTokenizer
ckpt = torch.load(f"{local}/pytorch_model.pt", weights_only=False, map_location="cpu")
model = GPT(ckpt["config"])
model.load_state_dict(ckpt["model"])
model.eval()
enc = RegexTokenizer()
enc.load(f"{local}/tokenizer/artifacts/mgpt2.model")
prompts = [
"Explain what photosynthesis is.", # English
"प्रकाश संश्लेषण क्या है?", # Hindi (Devanagari)
"ದ್ಯುತಿಸಂಶ್ಲೇಷಣೆ ಎಂದರೇನು?", # Kannada script
]
for prompt in prompts:
ids = enc.encode(prompt)
x = torch.tensor(ids, dtype=torch.long).unsqueeze(0)
with torch.no_grad():
for _ in range(120):
logits, _ = model(x[:, -1024:])
probs = F.softmax(logits[:, -1, :] / 0.7, dim=-1)
next_id = torch.multinomial(probs, num_samples=1)
if next_id.item() == 50256: break
x = torch.cat([x, next_id], dim=1)
print(f"Prompt : {prompt}")
print(f"Response: {enc.decode(x[0, len(ids):].tolist())}")
print()
Intended use
Good for:
- Multilingual instruction following with light safety alignment (en/hi/kn)
- Research: DPO alignment dynamics at 124M scale
- Demo of end-to-end LLM pipeline: pretrain → SFT → DPO
Not for: Production safety-critical applications. Alignment is format-preference alignment (coherent refusals vs incoherent noise), not full safety alignment. At 124M parameters the pretrained model could not generate coherent harmful content, so the DPO preference signal is weaker than production RLHF setups.
Model details
| Property | Value |
|---|---|
| Architecture | GPT-2 (12 layers / 12 heads / 768d) |
| Parameters | ~124M |
| Vocabulary | 50,257 (mgpt2 BPE) + padded to 50,304 |
| Context length | 1,024 tokens |
| Training stage | DPO (preference-aligned) |
| Git commit | e463752bb14b |
Training configuration
| Parameter | Value |
|---|---|
seed |
1337 |
batch_size |
32 |
micro_batch_size |
4 |
beta |
0.1 |
max_lr |
1e-06 |
min_lr_ratio |
0.1 |
warmup_steps |
20 |
epochs |
1 |
weight_decay |
0.1 |
eval_interval |
50 |
Evaluation
| Metric | Value | Notes |
|---|---|---|
| Preference win-rate | 1.000 | Held-out DPO pairs (n=1,496) |
| DPO val loss | ~0 | Training converged fully |
| SFT loss regression | +1.2% | Within 5% threshold (regression_ok=True) |
| Chosen log-p Δ | +6.1% | vs SFT checkpoint on same pairs |
| Rejected log-p Δ | −16.8% | vs SFT checkpoint on same pairs |
| Preference margin Δ | +29.1% | chosen − rejected margin widened |
100% win-rate reflects format-preference alignment (coherent refusals vs word-salad), not full safety alignment. See project report for full generation comparison.
Training data
| Language | Count | Chosen source | Rejected source |
|---|---|---|---|
English (eng_Latn) |
8,250 | Llama2-70B-Chat safety refusals | Phase C pretrained mgpt2 |
Hindi Devanagari (hin_Deva) |
2,700 | IndicTrans2-translated refusals | Phase C pretrained mgpt2 |
Kannada script (kan_Knda) |
1,950 | IndicTrans2-translated refusals | Phase C pretrained mgpt2 |
Hindi Latin (hin_Latn) |
1,050 | IndicTrans2 romanisation | Phase C pretrained mgpt2 |
Kannada Latin (kan_Latn) |
1,050 | IndicTrans2 romanisation | Phase C pretrained mgpt2 |
13,500 train / 1,499 val pairs. Source: ai4bharat/indic-align HHRLHF-T config.
Tokenizer
Custom multilingual regex + BPE tokenizer (mgpt2), trained on the same corpus mixture.
Same vocabulary size as tiktoken-gpt2 (50,257 tokens), but with Indic-aware merge priorities:
| Bucket | tiktoken-gpt2 | mgpt2 | Δ |
|---|---|---|---|
| Overall | 480 tok/kB | 223 tok/kB | −54% |
| Devanagari | 592 tok/kB | 215 tok/kB | −64% |
| Kannada | 981 tok/kB | 213 tok/kB | −78% |
| Latin | 257 tok/kB | 230 tok/kB | −10% |
Tokenizer published separately: ace-1/mgpt2-tokenizer
Known limitations
- Format-preference alignment, not full safety alignment. At 124M parameters, the pretrained model generates incoherent text for toxic prompts, so the DPO preference signal trains format preference (coherent refusals vs noise) rather than genuine safety reasoning.
- Transliterated Latin script drift (inherited from SFT checkpoint) —
hin_Latn/kan_Latnmay switch scripts mid-generation. - 124M parameters. Factual accuracy and multi-step reasoning are limited.
- Research checkpoint — not evaluated for production use.
Citation
@misc{mgpt2,
title = {mgpt2: Multilingual GPT-2 with custom Indic tokenizer},
year = {2026},
note = {Pretrain → SFT → DPO pipeline for English/Hindi/Kannada},
url = {https://huggingface.co/ace-1/mgpt2-dpo}
}
- Downloads last month
- 153