jmcinern
/

qwen_tokenizer_ga

Model card Files Files and versions

qwen_tokenizer_ga / README.md

jmcinern's picture

Update README.md

c14d9a3 verified 7 months ago

|

history blame contribute delete

2.92 kB

Model Card for Model ID

Cite

@software{mcinerney2025qwentkn,

author = {Joseph McInerney},

title = {{Qwen-Tokenizer-GA},

year = {2025}}

Monolingual Qwen tokenizer trained on Irish language data

Provides a ~50% reduction in number of tokens. (399 → 200 in test set).
Significantly improves identifying words as tokens.

Example:

cuirfidh an Stát sin san áireamh an fíoras go ndearna an duine lena mbaineann iarracht an cheartas a imghabháil

Translation:
the state shall take into account the fact that the person concerned made an attempt to evade justice

Comparison

Text	Qwen (Before Training)	Qwen-GA (After Training)
cuirfidh	cu ir fid h	cuirfidh
an	an	an
Stát	St át	Stát
sin	sin	sin
san	san	san
áireamh	á ire am h	áireamh
an	an	an
fíoras	f í oras	fío ras
go	go	go
ndearna	nd ear na	ndearna
an	an	an
duine	du ine	duine
lena	len a	lena
mbaineann	mb aine ann	mbaineann
iarracht	i arr acht	iarracht
an	an	an
ceartas	ce art as	c eartas
a	a	a
imghabháil	im gh abh á il	imghabháil
Total Tokens	42 tokens	21 tokens

Issues

Morphological mutations not modelled
Some errors, e.g: 'ceartas' -> ["c", "eartas"]