GPT-2 Text Opt โ Multilingual Orthographic Language Model
Pre-trained GPT-2 language model operating on (standard) orthographic text, trained on 18 languages from CulturaX. This is the Text Opt model from the ACL 2026 paper "Phonemes to the Rescue: Multilingual Tokenization Based on International Phonetic Alphabet" by Milan Miletiฤ, Julie Kallini, and Ekaterina Shutova.
The model uses a BPE tokenizer with 200k vocabulary trained on data-proportionally sampled multilingual text โ the configuration identified as optimal for text-mode tokenization. It serves as the orthographic baseline against which the IPA-based model (gpt2-ipa-opt) is compared.
Model Details
| Property | Value |
|---|---|
| Architecture | GPT-2 Small (decoder-only Transformer) |
| Layers / Heads / Hidden dim | 12 / 12 / 768 |
| Context length | 2048 tokens |
| Vocabulary size | 200,000 |
| Total parameters | ~240M |
| Tokenizer algorithm | BPE |
| Sampling strategy | Data-proportional |
| Training modality | Orthographic text |
Training
Data: 18 languages from CulturaX: Arabic, German, English, Persian, Finnish, French, Hindi, Italian, Japanese, Korean, Lao, Russian, Serbian, Swahili, Thai, Turkish, Urdu, Chinese.
Data was sampled proportionally to each language's share in the CulturaX corpus (in terms of bytes).
Hyperparameters:
| Parameter | Value |
|---|---|
| Training steps | 10,000 |
| Effective batch size | 64 |
| Learning rate | 5 ร 10โปโด |
| LR schedule | Cosine with warmup |
| Warmup ratio | 0.03 |
| Weight decay | 0.1 |
| Precision | bfloat16 |
| Attention | SDPA (scaled dot-product) |
| Seed | 42 |
| Hardware | NVIDIA H100 (Snellius HPC) |
Usage
from transformers import GPT2LMHeadModel
from huggingface_hub import hf_hub_download
import sentencepiece as spm
import torch
# Load model
model = GPT2LMHeadModel.from_pretrained("Mikki99/gpt2-text-opt")
model.eval()
# Load SentencePiece tokenizer
tok_path = hf_hub_download(repo_id="Mikki99/gpt2-text-opt", filename="tokenizer.model")
sp = spm.SentencePieceProcessor()
sp.Load(tok_path)
# Tokenize orthographic text directly
text = "Stanford"
input_ids = torch.tensor([sp.Encode(text, out_type=int)])
with torch.no_grad():
output = model(input_ids)
logits = output.logits
Intended Use
This model is a research artifact released alongside an ACL 2026 paper. Its primary purpose is to serve as an orthographic baseline for comparison with the IPA-based model (gpt2-ipa-opt), and to allow reproduction of the paper's results.
Citation
@inproceedings{miletic-etal-2026-phonemes,
title = "Phonemes to the Rescue: Multilingual Tokenization Based on International Phonetic Alphabet",
author = "Mileti{\'c}, Milan and
Kallini, Julie and
Shutova, Ekaterina",
editor = "Liakata, Maria and
Moreira, Viviane P. and
Zhang, Jiajun and
Jurgens, David",
booktitle = "Proceedings of the 64th Annual Meeting of the {A}ssociation for {C}omputational {L}inguistics (Volume 1: Long Papers)",
month = jul,
year = "2026",
address = "San Diego, California, United States",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2026.acl-long.1872/",
pages = "40323--40349",
ISBN = "979-8-89176-390-6",
abstract = "Multilingual language models often exhibit performance disparities across languages that can arise as early as the tokenization stage. Widely-used subword tokenization approaches favor high-resource languages, and tokenizer-free methods still yield longer sequences for scripts with a higher bytes-per-character ratio. To address these shortcomings, we propose to use the International Phonetic Alphabet (IPA) as a language-agnostic input representation for multilingual tokenizers. IPA provides a compact symbol inventory, greater cross-lingual character overlap, and a more balanced byte-per-character distribution across languages. We train matched pairs of text vs. IPA subword tokenizers across 24 languages and 14 scripts and demonstrate that IPA tokenizers consistently improve tokenization quality, especially for non-Latin scripts, and generalize more effectively to unseen languages and scripts."
}
- Downloads last month
- -