mgpt2-pretrain — Multilingual GPT-2 (Pretrained)
GPT-2 (124M parameters) trained from scratch on a multilingual corpus covering English,
Hindi (Devanagari + Latin transliteration), and Kannada (Kannada script + Latin transliteration).
Trained with a custom BPE tokenizer (mgpt2) that achieves 54% better compression than
tiktoken-gpt2 and 38% better than tiktoken-cl100k on the same corpus.
This is the base pretrained model. It is a causal language model and will continue text, not follow instructions. See the SFT and DPO variants for instruction-following versions.
Quick start
import sys, torch
import torch.nn.functional as F
from huggingface_hub import snapshot_download
local = snapshot_download("ace-1/mgpt2-pretrain")
sys.path.insert(0, local)
from model import GPT
from tokenizer.regex_tokenizer import RegexTokenizer
# Load model
ckpt = torch.load(f"{local}/pytorch_model.pt", weights_only=False, map_location="cpu")
model = GPT(ckpt["config"])
model.load_state_dict(ckpt["model"])
model.eval()
# Load tokenizer
enc = RegexTokenizer()
enc.load(f"{local}/tokenizer/artifacts/mgpt2.model")
# Generate
prompt = "ಕರ್ನಾಟಕದ ರಾಜಧಾನಿ" # "Capital of Karnataka"
ids = enc.encode(prompt)
x = torch.tensor(ids, dtype=torch.long).unsqueeze(0)
with torch.no_grad():
for _ in range(80):
logits, _ = model(x[:, -1024:])
probs = F.softmax(logits[:, -1, :] / 0.8, dim=-1)
next_id = torch.multinomial(probs, num_samples=1)
if next_id.item() == 50256: break
x = torch.cat([x, next_id], dim=1)
print(enc.decode(x[0].tolist()))
Intended use
Good for:
- Research: study multilingual pretraining dynamics
- Base for fine-tuning on Indic-language tasks
- Tokenizer efficiency benchmarking against tiktoken baselines
Not for: Direct end-user applications (not instruction-tuned; not safety-filtered).
Model details
| Property | Value |
|---|---|
| Architecture | GPT-2 (12 layers / 12 heads / 768d) |
| Parameters | ~124M |
| Vocabulary | 50,257 (mgpt2 BPE) + padded to 50,304 |
| Context length | 1,024 tokens |
| Training stage | Pretrained |
| Git commit | 8aa064aaf1da |
Training configuration
| Parameter | Value |
|---|---|
seed |
1337 |
block_size |
1024 |
total_batch_size |
524288 |
micro_batch_size |
64 |
max_steps |
27538 |
warmup_steps |
715 |
max_lr |
0.003 |
min_lr_ratio |
0.1 |
weight_decay |
0.1 |
eval_interval |
250 |
Evaluation
| Metric | Value | Notes |
|---|---|---|
| Val loss | 2.5003 | Cross-entropy on held-out corpus |
| HellaSwag acc | 0.2869 | 10,042 examples, own tokenizer |
| BPB overall | 0.809 | bits-per-byte, normalised for tokenizer density |
| BPB vs baseline | −0.071 | mgpt2 better on every language bucket |
Raw perplexity (12.4) is higher than the GPT-2-tokenized baseline (3.6) — this comparison is invalid across tokenizers. Bits-per-byte (BPB) is the fair metric and reverses the result: mgpt2 wins on every bucket. HellaSwag z=1.59 (directional, not significant at 95% CI); BPB is the primary metric.
Training data
| Split | Source | Weight |
|---|---|---|
| English | FineWeb | 55% |
| Hindi (Devanagari) | AI4Bharat Sangraha verified/hin |
18% |
| Hindi (Latin translit) | AI4Bharat Sangraha synthetic/hin_Latn |
7% |
| Kannada (script) | AI4Bharat Sangraha verified/kan |
13% |
| Kannada (Latin translit) | AI4Bharat Sangraha synthetic/kan_Latn |
7% |
15M documents, globally shuffled, ~40GB raw text. ~14.4B tokens after tokenization (27,537 × 524,288).
Tokenizer
Custom multilingual regex + BPE tokenizer (mgpt2), trained on the same corpus mixture.
Same vocabulary size as tiktoken-gpt2 (50,257 tokens), but with Indic-aware merge priorities:
| Bucket | tiktoken-gpt2 | mgpt2 | Δ |
|---|---|---|---|
| Overall | 480 tok/kB | 223 tok/kB | −54% |
| Devanagari | 592 tok/kB | 215 tok/kB | −64% |
| Kannada | 981 tok/kB | 213 tok/kB | −78% |
| Latin | 257 tok/kB | 230 tok/kB | −10% |
Tokenizer published separately: ace-1/mgpt2-tokenizer
Known limitations
- Not instruction-tuned. The model continues text; it does not follow instructions or answer questions reliably.
- Transliterated Latin (hin_Latn, kan_Latn) generation quality is lower than native-script variants — shared ASCII token space with English makes script boundaries ambiguous.
- 124M parameters — significantly smaller than modern LLMs; factual accuracy and reasoning are limited.
- Research checkpoint — not evaluated for safety or production use.
Citation
@misc{mgpt2,
title = {mgpt2: Multilingual GPT-2 with custom Indic tokenizer},
year = {2026},
note = {Pretrain → SFT → DPO pipeline for English/Hindi/Kannada},
url = {https://huggingface.co/ace-1/mgpt2-pretrain}
}
- Downloads last month
- 155