YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Model Card for Model ID
Cite
@software{mcinerney2025qwentkn,
author = {Joseph McInerney},
title = {{Qwen-Tokenizer-GA},
year = {2025}}
Monolingual Qwen tokenizer trained on Irish language data
- Provides a ~50% reduction in number of tokens. (399 → 200 in test set).
- Significantly improves identifying words as tokens.
Example:
cuirfidh an Stát sin san áireamh an fíoras go ndearna an duine lena mbaineann iarracht an cheartas a imghabháil
Translation:the state shall take into account the fact that the person concerned made an attempt to evade justice
Comparison
| Text | Qwen (Before Training) | Qwen-GA (After Training) |
|---|---|---|
| cuirfidh | cu ir fid h | cuirfidh |
| an | an | an |
| Stát | St át | Stát |
| sin | sin | sin |
| san | san | san |
| áireamh | á ire am h | áireamh |
| an | an | an |
| fíoras | f í oras | fío ras |
| go | go | go |
| ndearna | nd ear na | ndearna |
| an | an | an |
| duine | du ine | duine |
| lena | len a | lena |
| mbaineann | mb aine ann | mbaineann |
| iarracht | i arr acht | iarracht |
| an | an | an |
| ceartas | ce art as | c eartas |
| a | a | a |
| imghabháil | im gh abh á il | imghabháil |
| Total Tokens | 42 tokens | 21 tokens |
Issues
- Morphological mutations not modelled
- Some errors, e.g: 'ceartas' -> ["c", "eartas"]
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support