mgpt2-pretrain — Multilingual GPT-2 (Pretrained)

GPT-2 (124M parameters) trained from scratch on a multilingual corpus covering English, Hindi (Devanagari + Latin transliteration), and Kannada (Kannada script + Latin transliteration). Trained with a custom BPE tokenizer (mgpt2) that achieves 54% better compression than tiktoken-gpt2 and 38% better than tiktoken-cl100k on the same corpus.

This is the base pretrained model. It is a causal language model and will continue text, not follow instructions. See the SFT and DPO variants for instruction-following versions.

Quick start

import sys, torch
import torch.nn.functional as F
from huggingface_hub import snapshot_download

local = snapshot_download("ace-1/mgpt2-pretrain")
sys.path.insert(0, local)
from model import GPT
from tokenizer.regex_tokenizer import RegexTokenizer

# Load model
ckpt = torch.load(f"{local}/pytorch_model.pt", weights_only=False, map_location="cpu")
model = GPT(ckpt["config"])
model.load_state_dict(ckpt["model"])
model.eval()

# Load tokenizer
enc = RegexTokenizer()
enc.load(f"{local}/tokenizer/artifacts/mgpt2.model")

# Generate
prompt = "ಕರ್ನಾಟಕದ ರಾಜಧಾನಿ"   # "Capital of Karnataka"
ids = enc.encode(prompt)
x = torch.tensor(ids, dtype=torch.long).unsqueeze(0)
with torch.no_grad():
    for _ in range(80):
        logits, _ = model(x[:, -1024:])
        probs = F.softmax(logits[:, -1, :] / 0.8, dim=-1)
        next_id = torch.multinomial(probs, num_samples=1)
        if next_id.item() == 50256: break
        x = torch.cat([x, next_id], dim=1)
print(enc.decode(x[0].tolist()))

Intended use

Good for:

  • Research: study multilingual pretraining dynamics
  • Base for fine-tuning on Indic-language tasks
  • Tokenizer efficiency benchmarking against tiktoken baselines

Not for: Direct end-user applications (not instruction-tuned; not safety-filtered).

Model details

Property Value
Architecture GPT-2 (12 layers / 12 heads / 768d)
Parameters ~124M
Vocabulary 50,257 (mgpt2 BPE) + padded to 50,304
Context length 1,024 tokens
Training stage Pretrained
Git commit 8aa064aaf1da

Training configuration

Parameter Value
seed 1337
block_size 1024
total_batch_size 524288
micro_batch_size 64
max_steps 27538
warmup_steps 715
max_lr 0.003
min_lr_ratio 0.1
weight_decay 0.1
eval_interval 250

Evaluation

Metric Value Notes
Val loss 2.5003 Cross-entropy on held-out corpus
HellaSwag acc 0.2869 10,042 examples, own tokenizer
BPB overall 0.809 bits-per-byte, normalised for tokenizer density
BPB vs baseline −0.071 mgpt2 better on every language bucket

Raw perplexity (12.4) is higher than the GPT-2-tokenized baseline (3.6) — this comparison is invalid across tokenizers. Bits-per-byte (BPB) is the fair metric and reverses the result: mgpt2 wins on every bucket. HellaSwag z=1.59 (directional, not significant at 95% CI); BPB is the primary metric.

Training data

Split Source Weight
English FineWeb 55%
Hindi (Devanagari) AI4Bharat Sangraha verified/hin 18%
Hindi (Latin translit) AI4Bharat Sangraha synthetic/hin_Latn 7%
Kannada (script) AI4Bharat Sangraha verified/kan 13%
Kannada (Latin translit) AI4Bharat Sangraha synthetic/kan_Latn 7%

15M documents, globally shuffled, ~40GB raw text. ~14.4B tokens after tokenization (27,537 × 524,288).

Tokenizer

Custom multilingual regex + BPE tokenizer (mgpt2), trained on the same corpus mixture. Same vocabulary size as tiktoken-gpt2 (50,257 tokens), but with Indic-aware merge priorities:

Bucket tiktoken-gpt2 mgpt2 Δ
Overall 480 tok/kB 223 tok/kB −54%
Devanagari 592 tok/kB 215 tok/kB −64%
Kannada 981 tok/kB 213 tok/kB −78%
Latin 257 tok/kB 230 tok/kB −10%

Tokenizer published separately: ace-1/mgpt2-tokenizer

Known limitations

  • Not instruction-tuned. The model continues text; it does not follow instructions or answer questions reliably.
  • Transliterated Latin (hin_Latn, kan_Latn) generation quality is lower than native-script variants — shared ASCII token space with English makes script boundaries ambiguous.
  • 124M parameters — significantly smaller than modern LLMs; factual accuracy and reasoning are limited.
  • Research checkpoint — not evaluated for safety or production use.

Citation

@misc{mgpt2,
  title     = {mgpt2: Multilingual GPT-2 with custom Indic tokenizer},
  year      = {2026},
  note      = {Pretrain → SFT → DPO pipeline for English/Hindi/Kannada},
  url       = {https://huggingface.co/ace-1/mgpt2-pretrain}
}
Downloads last month
155
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ace-1/mgpt2-pretrain

Finetunes
1 model