qwen_tokenizer_ga / README.md
jmcinern's picture
Update README.md
c14d9a3 verified

Model Card for Model ID

Cite

@software{mcinerney2025qwentkn,

author = {Joseph McInerney},

title = {{Qwen-Tokenizer-GA},

year = {2025}}

Monolingual Qwen tokenizer trained on Irish language data

  • Provides a ~50% reduction in number of tokens. (399 → 200 in test set).
  • Significantly improves identifying words as tokens.

Example:

cuirfidh an Stát sin san áireamh an fíoras go ndearna an duine lena mbaineann iarracht an cheartas a imghabháil

Translation:
the state shall take into account the fact that the person concerned made an attempt to evade justice

Comparison

Text Qwen (Before Training) Qwen-GA (After Training)
cuirfidh cu ir fid h cuirfidh
an an an
Stát St át Stát
sin sin sin
san san san
áireamh á ire am h áireamh
an an an
fíoras f í oras fío ras
go go go
ndearna nd ear na ndearna
an an an
duine du ine duine
lena len a lena
mbaineann mb aine ann mbaineann
iarracht i arr acht iarracht
an an an
ceartas ce art as c eartas
a a a
imghabháil im gh abh á il imghabháil
Total Tokens 42 tokens 21 tokens

Issues

  • Morphological mutations not modelled
  • Some errors, e.g: 'ceartas' -> ["c", "eartas"]