GPT-2 IPA Opt — Multilingual IPA Language Model

Paper GitHub

Pre-trained GPT-2 language model operating on International Phonetic Alphabet (IPA) representations, trained on 18 languages from CulturaX. This is the IPA Opt model from the ACL 2026 paper "Phonemes to the Rescue: Multilingual Tokenization Based on International Phonetic Alphabet" by Milan Miletić, Julie Kallini, and Ekaterina Shutova.

The model uses a UnigramLM tokenizer with 200k vocabulary trained on byte-uniformly sampled multilingual IPA text — the configuration identified as optimal for IPA-mode tokenization.

Note: This model operates on IPA text, not raw orthographic text. Input must first be converted to IPA using a G2P tool (see Usage below).

Model Details

Property Value
Architecture GPT-2 Small (decoder-only Transformer)
Layers / Heads / Hidden dim 12 / 12 / 768
Context length 2048 tokens
Vocabulary size 200,000
Total parameters ~240M
Tokenizer algorithm UnigramLM
Sampling strategy Byte-uniform
Training modality IPA (phonetic)

Training

Data: 18 languages from CulturaX: Arabic, German, English, Persian, Finnish, French, Hindi, Italian, Japanese, Korean, Lao, Russian, Serbian, Swahili, Thai, Turkish, Urdu, Chinese.

Text was first converted to IPA using Epitran.

Hyperparameters:

Parameter Value
Training steps 10,000
Effective batch size 64
Learning rate 5 × 10⁻⁴
LR schedule Cosine with warmup
Warmup ratio 0.03
Weight decay 0.1
Precision bfloat16
Attention SDPA (scaled dot-product)
Seed 42
Hardware NVIDIA H100 (Snellius HPC)

Usage

This model requires IPA text as input. To use it, convert your text to IPA first using the pipeline described in ipa-tokenization repository, then tokenize with the included SentencePiece model.

from transformers import GPT2LMHeadModel
from huggingface_hub import hf_hub_download
import sentencepiece as spm
import torch

# Load model
model = GPT2LMHeadModel.from_pretrained("Mikki99/gpt2-ipa-opt")
model.eval()

# Load SentencePiece tokenizer
tok_path = hf_hub_download(repo_id="Mikki99/gpt2-ipa-opt", filename="tokenizer.model")
sp = spm.SentencePieceProcessor()
sp.Load(tok_path)

# Input must be IPA text (convert using G2P tools first)
# Example: Text (en) "Stanford" → IPA "stænfɹ̩d"
ipa_text = "stænfɹ̩d"
input_ids = torch.tensor([sp.Encode(ipa_text, out_type=int)])

with torch.no_grad():
    output = model(input_ids)
    logits = output.logits

Inteded Use

This model is a research artifact released alongside an ACL 2026 paper. Its primary purpose is to allow reproduction of the paper's results. It is not optimized for open-ended text generation.

Citation

@inproceedings{miletic-etal-2026-phonemes,
    title = "Phonemes to the Rescue: Multilingual Tokenization Based on International Phonetic Alphabet",
    author = "Mileti{\'c}, Milan  and
      Kallini, Julie  and
      Shutova, Ekaterina",
    editor = "Liakata, Maria  and
      Moreira, Viviane P.  and
      Zhang, Jiajun  and
      Jurgens, David",
    booktitle = "Proceedings of the 64th Annual Meeting of the {A}ssociation for {C}omputational {L}inguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2026",
    address = "San Diego, California, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.acl-long.1872/",
    pages = "40323--40349",
    ISBN = "979-8-89176-390-6",
    abstract =
 "Multilingual language models often exhibit performance disparities across languages that can arise as early as the tokenization stage. Widely-used subword tokenization approaches favor high-resource languages, and tokenizer-free methods still yield longer sequences for scripts with a higher bytes-per-character ratio. To address these shortcomings, we propose to use the International Phonetic Alphabet (IPA) as a language-agnostic input representation for multilingual tokenizers. IPA provides a compact symbol inventory, greater cross-lingual character overlap, and a more balanced byte-per-character distribution across languages. We train matched pairs of text vs. IPA subword tokenizers across 24 languages and 14 scripts and demonstrate that IPA tokenizers consistently improve tokenization quality, especially for non-Latin scripts, and generalize more effectively to unseen languages and scripts."
}
Downloads last month
-
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Mikki99/gpt2-ipa-opt