qwen_tokenizer_ga / README.md
jmcinern's picture
Update README.md
c14d9a3 verified
# Model Card for Model ID
## Cite
@software{mcinerney2025qwentkn,
author = {Joseph McInerney},
title = {{Qwen-Tokenizer-GA},
year = {2025}}
## Monolingual Qwen tokenizer trained on Irish language data
- Provides a ~50% reduction in number of tokens. (399 → 200 in test set).
- Significantly improves identifying words as tokens.
## Example:
`cuirfidh an Stát sin san áireamh an fíoras go ndearna an duine lena mbaineann iarracht an cheartas a imghabháil`
**Translation:**
`the state shall take into account the fact that the person concerned made an attempt to evade justice`
## Comparison
| **Text** | **Qwen (Before Training)** | **Qwen-GA (After Training)** |
|-----------------|-----------------------------------------------|------------------------------|
| cuirfidh | cu ir fid h | cuirfidh |
| an | an | an |
| Stát | St át | Stát |
| sin | sin | sin |
| san | san | san |
| áireamh | á ire am h | áireamh |
| an | an | an |
| fíoras | f í oras | fío ras |
| go | go | go |
| ndearna | nd ear na | ndearna |
| an | an | an |
| duine | du ine | duine |
| lena | len a | lena |
| mbaineann | mb aine ann | mbaineann |
| iarracht | i arr acht | iarracht |
| an | an | an |
| ceartas | ce art as | c eartas |
| a | a | a |
| imghabháil | im gh abh á il | imghabháil |
**Total Tokens** | **42 tokens** | **21 tokens** |
## Issues
- Morphological mutations not modelled
- Some errors, e.g: 'ceartas' -> ["c", "eartas"]