YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Model Card for Model ID

Cite

@software{mcinerney2025qwentkn,

author = {Joseph McInerney},

title = {{Qwen-Tokenizer-GA},

year = {2025}}

Monolingual Qwen tokenizer trained on Irish language data

  • Provides a ~50% reduction in number of tokens. (399 → 200 in test set).
  • Significantly improves identifying words as tokens.

Example:

cuirfidh an Stát sin san áireamh an fíoras go ndearna an duine lena mbaineann iarracht an cheartas a imghabháil

Translation:
the state shall take into account the fact that the person concerned made an attempt to evade justice

Comparison

Text Qwen (Before Training) Qwen-GA (After Training)
cuirfidh cu ir fid h cuirfidh
an an an
Stát St át Stát
sin sin sin
san san san
áireamh á ire am h áireamh
an an an
fíoras f í oras fío ras
go go go
ndearna nd ear na ndearna
an an an
duine du ine duine
lena len a lena
mbaineann mb aine ann mbaineann
iarracht i arr acht iarracht
an an an
ceartas ce art as c eartas
a a a
imghabháil im gh abh á il imghabháil
Total Tokens 42 tokens 21 tokens

Issues

  • Morphological mutations not modelled
  • Some errors, e.g: 'ceartas' -> ["c", "eartas"]
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support