File size: 2,985 Bytes
725597f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98aeb25
725597f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---
language:
  - am
  - ti
license: mit
tags:
  - tokenizer
  - byte-pair-encoding
  - bpe
  - geez-script
  - amharic
  - tigrinya
  - low-resource
  - nlp
  - morphology-aware
  - Horn of Africa
datasets:
  - HornMT
library_name: transformers
pipeline_tag: token-classification
widget:
  - text: "!"
model-index:
  - name: Geez BPE Tokenizer
    results: []
---

# Geez Tokenizer (`Hailay/geez-tokenizer`)

A **BPE tokenizer** specifically trained for **Geez-script languages**, including **Tigrinya** and **Amharic**. The tokenizer is trained on monolingual corpora  and supports morphologically rich low-resource languages.

## ๐Ÿง  Motivation

Byte-Pair Encoding (BPE) tokenizers trained on English or Latin-script languages often fail to tokenize Geez-script languages efficiently. This tokenizer aims to:

- Reduce over-segmentation errors
- Respect morpheme boundaries
- Improve language understanding for downstream tasks like Machine Translation and QA

## ๐Ÿ“š Training Details

- **Tokenizer Type**: BPE
- **Vocabulary Size**: 32,000
- **Pre-tokenizer**: Whitespace
- **Normalizer**: NFD โ†’ Lowercase โ†’ StripAccents
- **Special Tokens**: `[PAD]`, `[UNK]`, `[CLS]`, `[SEP]`, `[MASK]`
- **Post-processing**: Template for `[CLS] $A [SEP]` and `[CLS] $A [SEP] $B [SEP]`

## ๐Ÿ“ Files

- `vocab.json`: Vocabulary file
- `merges.txt`: Merge rules for BPE
- `tokenizer.json`: Full tokenizer config
- `tokenizer_config.json`: Hugging Face-compatible configuration
- `special_tokens_map.json`: Maps for special tokens

## ๐Ÿš€ Usage

```python
from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("Hailay/geez-tokenizer")

text = "แ‹จแŒแ‰ฅแ… แŠ แˆญแŠชแŠฆแˆŽแŒ‚แˆตแ‰ถแ‰ฝ แ‰ แˆณแŠฉแˆซ แŠ”แŠญแˆฎแ–แˆŠแˆต แ‹แˆตแŒฅ แ‹จแ‰ฐแŒˆแŠ˜แ‹แŠ• แ‰ตแˆแ‰แŠ• แˆ˜แ‰ƒแ‰ฅแˆญ แŠ แŒแŠแ‰ฐแ‹‹แˆแข"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)

print("Tokens:", tokens)
print("Token IDs:", ids)

## ๐Ÿ“Š Intended Use

This tokenizer is best suited for:

Low-resource NLP pipelines

Machine Translation

Question Answering

Named Entity Recognition

Morphological analysis



โœ‹ #**Limitations**
It is optimized for Geez-script languages and might not generalize to others.

Some compound verbs and morphologically fused words may still require linguistic preprocessing.

Currently monolingual for Amharic and Tigrinya; does not support multilingual code-switching.


โœ… #**Evaluation**
The tokenizer was evaluated manually on:

Token coverage of Tigrinya/Amharic corpora

Morphological preservation

Reduction of BPE segmentation errors

Quantitative metrics to be published in an accompanying paper.

๐Ÿ“œ #**License**
This tokenizer is licensed under the MIT License.
๐Ÿ“Œ #**Citation**

@misc{hailay2025geez,
  title={Geสฝez Script_Tokenizer: A Morpheme-Aware BPE Tokenizer for Geez Script Languages},
  author={Teklehaymanot, Hailay},
  year={2025},
  howpublished={\url{https://huggingface.co/Hailay/geez-tokenizer}},
}