mgpt2-pretrain — Multilingual GPT-2 (Pretrained)

GPT-2 (124M parameters) trained from scratch on a multilingual corpus covering English, Hindi (Devanagari + Latin transliteration), and Kannada (Kannada script + Latin transliteration). Trained with a custom BPE tokenizer (mgpt2) that achieves 54% better compression than tiktoken-gpt2 and 38% better than tiktoken-cl100k on the same corpus.

This is the base pretrained model. It is a causal language model and will continue text, not follow instructions. See the SFT and DPO variants for instruction-following versions.

Quick start

import sys, torch
import torch.nn.functional as F
from huggingface_hub import snapshot_download

local = snapshot_download("ace-1/mgpt2-pretrain")
sys.path.insert(0, local)
from model import GPT
from tokenizer.regex_tokenizer import RegexTokenizer

# Load model
ckpt = torch.load(f"{local}/pytorch_model.pt", weights_only=False, map_location="cpu")
model = GPT(ckpt["config"])
model.load_state_dict(ckpt["model"])
model.eval()

# Load tokenizer
enc = RegexTokenizer()
enc.load(f"{local}/tokenizer/artifacts/mgpt2.model")

# Generate
prompt = "ಕರ್ನಾಟಕದ ರಾಜಧಾನಿ"   # "Capital of Karnataka"
ids = enc.encode(prompt)
x = torch.tensor(ids, dtype=torch.long).unsqueeze(0)
with torch.no_grad():
    for _ in range(80):
        logits, _ = model(x[:, -1024:])
        probs = F.softmax(logits[:, -1, :] / 0.8, dim=-1)
        next_id = torch.multinomial(probs, num_samples=1)
        if next_id.item() == 50256: break
        x = torch.cat([x, next_id], dim=1)
print(enc.decode(x[0].tolist()))

Intended use

Good for:

Research: study multilingual pretraining dynamics
Base for fine-tuning on Indic-language tasks
Tokenizer efficiency benchmarking against tiktoken baselines

Not for: Direct end-user applications (not instruction-tuned; not safety-filtered).

Model details

Property	Value
Architecture	GPT-2 (12 layers / 12 heads / 768d)
Parameters	~124M
Vocabulary	50,257 (mgpt2 BPE) + padded to 50,304
Context length	1,024 tokens
Training stage	Pretrained
Git commit	`8aa064aaf1da`

Training configuration

Parameter	Value
`seed`	`1337`
`block_size`	`1024`
`total_batch_size`	`524288`
`micro_batch_size`	`64`
`max_steps`	`27538`
`warmup_steps`	`715`
`max_lr`	`0.003`
`min_lr_ratio`	`0.1`
`weight_decay`	`0.1`
`eval_interval`	`250`

Evaluation

Metric	Value	Notes
Val loss	2.5003	Cross-entropy on held-out corpus
HellaSwag acc	0.2869	10,042 examples, own tokenizer
BPB overall	0.809	bits-per-byte, normalised for tokenizer density
BPB vs baseline	−0.071	mgpt2 better on every language bucket

Raw perplexity (12.4) is higher than the GPT-2-tokenized baseline (3.6) — this comparison is invalid across tokenizers. Bits-per-byte (BPB) is the fair metric and reverses the result: mgpt2 wins on every bucket. HellaSwag z=1.59 (directional, not significant at 95% CI); BPB is the primary metric.

Training data

Split	Source	Weight
English	FineWeb	55%
Hindi (Devanagari)	AI4Bharat Sangraha `verified/hin`	18%
Hindi (Latin translit)	AI4Bharat Sangraha `synthetic/hin_Latn`	7%
Kannada (script)	AI4Bharat Sangraha `verified/kan`	13%
Kannada (Latin translit)	AI4Bharat Sangraha `synthetic/kan_Latn`	7%

15M documents, globally shuffled, ~40GB raw text. ~14.4B tokens after tokenization (27,537 × 524,288).

Tokenizer

Custom multilingual regex + BPE tokenizer (mgpt2), trained on the same corpus mixture. Same vocabulary size as tiktoken-gpt2 (50,257 tokens), but with Indic-aware merge priorities:

Bucket	tiktoken-gpt2	mgpt2	Δ
Overall	480 tok/kB	223 tok/kB	−54%
Devanagari	592 tok/kB	215 tok/kB	−64%
Kannada	981 tok/kB	213 tok/kB	−78%
Latin	257 tok/kB	230 tok/kB	−10%

Tokenizer published separately: ace-1/mgpt2-tokenizer

Known limitations

Not instruction-tuned. The model continues text; it does not follow instructions or answer questions reliably.
Transliterated Latin (hin_Latn, kan_Latn) generation quality is lower than native-script variants — shared ASCII token space with English makes script boundaries ambiguous.
124M parameters — significantly smaller than modern LLMs; factual accuracy and reasoning are limited.
Research checkpoint — not evaluated for safety or production use.

Citation

@misc{mgpt2,
  title     = {mgpt2: Multilingual GPT-2 with custom Indic tokenizer},
  year      = {2026},
  note      = {Pretrain → SFT → DPO pipeline for English/Hindi/Kannada},
  url       = {https://huggingface.co/ace-1/mgpt2-pretrain}
}

Downloads last month: 1

Model tree for ace-1/mgpt2-pretrain

Finetunes

1 model