| # Model Card for Model ID | |
| ## Cite | |
| @software{mcinerney2025qwentkn, | |
| author = {Joseph McInerney}, | |
| title = {{Qwen-Tokenizer-GA}, | |
| year = {2025}} | |
| ## Monolingual Qwen tokenizer trained on Irish language data | |
| - Provides a ~50% reduction in number of tokens. (399 → 200 in test set). | |
| - Significantly improves identifying words as tokens. | |
| ## Example: | |
| `cuirfidh an Stát sin san áireamh an fíoras go ndearna an duine lena mbaineann iarracht an cheartas a imghabháil` | |
| **Translation:** | |
| `the state shall take into account the fact that the person concerned made an attempt to evade justice` | |
| ## Comparison | |
| | **Text** | **Qwen (Before Training)** | **Qwen-GA (After Training)** | | |
| |-----------------|-----------------------------------------------|------------------------------| | |
| | cuirfidh | cu ir fid h | cuirfidh | | |
| | an | an | an | | |
| | Stát | St át | Stát | | |
| | sin | sin | sin | | |
| | san | san | san | | |
| | áireamh | á ire am h | áireamh | | |
| | an | an | an | | |
| | fíoras | f í oras | fío ras | | |
| | go | go | go | | |
| | ndearna | nd ear na | ndearna | | |
| | an | an | an | | |
| | duine | du ine | duine | | |
| | lena | len a | lena | | |
| | mbaineann | mb aine ann | mbaineann | | |
| | iarracht | i arr acht | iarracht | | |
| | an | an | an | | |
| | ceartas | ce art as | c eartas | | |
| | a | a | a | | |
| | imghabháil | im gh abh á il | imghabháil | | |
| **Total Tokens** | **42 tokens** | **21 tokens** | | |
| ## Issues | |
| - Morphological mutations not modelled | |
| - Some errors, e.g: 'ceartas' -> ["c", "eartas"] | |