jmcinern
/

qwen_tokenizer_ga

Model card Files Files and versions

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Model Card for Model ID

Cite

@software{mcinerney2025qwentkn,

author = {Joseph McInerney},

title = {{Qwen-Tokenizer-GA},

year = {2025}}

Monolingual Qwen tokenizer trained on Irish language data

Provides a ~50% reduction in number of tokens. (399 → 200 in test set).
Significantly improves identifying words as tokens.

Example:

cuirfidh an Stát sin san áireamh an fíoras go ndearna an duine lena mbaineann iarracht an cheartas a imghabháil

Translation:
the state shall take into account the fact that the person concerned made an attempt to evade justice

Comparison

Text	Qwen (Before Training)	Qwen-GA (After Training)
cuirfidh	cu ir fid h	cuirfidh
an	an	an
Stát	St át	Stát
sin	sin	sin
san	san	san
áireamh	á ire am h	áireamh
an	an	an
fíoras	f í oras	fío ras
go	go	go
ndearna	nd ear na	ndearna
an	an	an
duine	du ine	duine
lena	len a	lena
mbaineann	mb aine ann	mbaineann
iarracht	i arr acht	iarracht
an	an	an
ceartas	ce art as	c eartas
a	a	a
imghabháil	im gh abh á il	imghabháil
Total Tokens	42 tokens	21 tokens

Issues

Morphological mutations not modelled
Some errors, e.g: 'ceartas' -> ["c", "eartas"]

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including jmcinern/qwen_tokenizer_ga

Qomhrá

Collection of Models, Datasets and Tokenizers used to Develop Qomhrá. Paper: https://aclanthology.org/2026.loreslm-1.18/ • 20 items • Updated 4 days ago