# Model Card for Model ID
## Cite

@software{mcinerney2025qwentkn,

  author       = {Joseph McInerney},
  
  title        = {{Qwen-Tokenizer-GA},
  
  year         = {2025}}
  
## Monolingual Qwen tokenizer trained on Irish language data  
- Provides a ~50% reduction in number of tokens. (399 → 200 in test set).
- Significantly improves identifying words as tokens.

## Example:  
`cuirfidh an Stát sin san áireamh an fíoras go ndearna an duine lena mbaineann iarracht an cheartas a imghabháil`

**Translation:**  
`the state shall take into account the fact that the person concerned made an attempt to evade justice`

## Comparison

| **Text**        | **Qwen (Before Training)**                   | **Qwen-GA (After Training)** |
|-----------------|-----------------------------------------------|------------------------------|
| cuirfidh        | cu ir fid h                                  | cuirfidh                    |
| an              | an                                            | an                          |
| Stát            | St át                                         | Stát                        |
| sin             | sin                                           | sin                         |
| san             | san                                           | san                         |
| áireamh         | á ire am h                                    | áireamh                     |
| an              | an                                            | an                          |
| fíoras          | f í oras                                      | fío ras                     |
| go              | go                                            | go                          |
| ndearna         | nd ear na                                     | ndearna                     |
| an              | an                                            | an                          |
| duine           | du ine                                        | duine                       |
| lena            | len a                                         | lena                        |
| mbaineann       | mb aine ann                                   | mbaineann                   |
| iarracht        | i arr acht                                    | iarracht                    |
| an              | an                                            | an                          |
| ceartas         | ce art as                                     | c eartas                    |
| a               | a                                             | a                           |
| imghabháil      | im gh abh á il                                | imghabháil                  |
**Total Tokens**  | **42 tokens**                                 | **21 tokens**               |

## Issues
- Morphological mutations not modelled
- Some errors, e.g: 'ceartas' -> ["c", "eartas"]