# Model Card for Model ID ## Cite @software{mcinerney2025qwentkn, author = {Joseph McInerney}, title = {{Qwen-Tokenizer-GA}, year = {2025}} ## Monolingual Qwen tokenizer trained on Irish language data - Provides a ~50% reduction in number of tokens. (399 → 200 in test set). - Significantly improves identifying words as tokens. ## Example: `cuirfidh an Stát sin san áireamh an fíoras go ndearna an duine lena mbaineann iarracht an cheartas a imghabháil` **Translation:** `the state shall take into account the fact that the person concerned made an attempt to evade justice` ## Comparison | **Text** | **Qwen (Before Training)** | **Qwen-GA (After Training)** | |-----------------|-----------------------------------------------|------------------------------| | cuirfidh | cu ir fid h | cuirfidh | | an | an | an | | Stát | St át | Stát | | sin | sin | sin | | san | san | san | | áireamh | á ire am h | áireamh | | an | an | an | | fíoras | f í oras | fío ras | | go | go | go | | ndearna | nd ear na | ndearna | | an | an | an | | duine | du ine | duine | | lena | len a | lena | | mbaineann | mb aine ann | mbaineann | | iarracht | i arr acht | iarracht | | an | an | an | | ceartas | ce art as | c eartas | | a | a | a | | imghabháil | im gh abh á il | imghabháil | **Total Tokens** | **42 tokens** | **21 tokens** | ## Issues - Morphological mutations not modelled - Some errors, e.g: 'ceartas' -> ["c", "eartas"]