arxiv:2602.06998

Tokenizations for Austronesian Language Models: study on languages in Indonesia Archipelago

Published on Jan 28

Authors:

Abstract

Syllable-based tokenization using Indonesian script principles improves linguistic structure preservation and token sequence similarity for Austronesian languages compared to traditional subword methods.

AI-generated summary

Tokenization constitutes a fundamental stage in Large Language Model (LLM) processing; however, subword-based tokenization methods optimized on English-dominant corpora may produce token fragmentation misaligned with the linguistic structures of Austronesian languages. This study aimed to develop a syllable-based tokenization framework adopting principles from traditional Indonesian scripts (aksara) for regional languages of Indonesia. A syllabic segmentation procedure was constructed based on the logic of abugida writing systems and implemented with a vocabulary of 2,843 tokens extracted from the Indonesian dictionary (KBBI). Evaluation was conducted on the NusaX dataset comprising 1,000 parallel translation samples across 10 regional languages, Indonesian, and English. Analysis employed Token per Character (TPC) ratio and sequence alignment using the Smith-Waterman algorithm. Results demonstrated that syllable-based tokenization yielded consistent TPC values across all regional languages, whereas GPT-2 exhibited an inverse pattern with the lowest TPC for English. Syllable-based tokenization consistently produced higher token sequence similarity scores, with an average increase of approximately 21% compared to GPT-2. These findings confirm that the syllable-based approach more effectively preserves phonological and morphological patterns across related Austronesian languages, offering a linguistically principled foundation for multilingual LLM development.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.06998 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.06998 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.06998 in a Space README.md to link it from this page.