Create README.md

Browse files

Files changed (1) hide show

README.md +115 -0

README.md ADDED Viewed

	@@ -0,0 +1,115 @@

+---
+language:
+  - am
+  - ti
+license: mit
+tags:
+  - tokenizer
+  - byte-pair-encoding
+  - bpe
+  - geez-script
+  - amharic
+  - tigrinya
+  - low-resource
+  - nlp
+  - morphology-aware
+  - Horn of Africa
+datasets:
+  - HornMT
+library_name: transformers
+pipeline_tag: token-classification
+widget:
+  - text: "!"
+model-index:
+  - name: Geez BPE Tokenizer
+    results: []
+---
+# Geez Tokenizer (`Hailay/geez-tokenizer`)
+A **BPE tokenizer** specifically trained for **Geez-script languages**, including **Tigrinya** and **Amharic**. The tokenizer is trained on monolingual corpora derived from the [HornMT](https://github.com/HornMT) project and supports morphologically rich low-resource languages.
+## 🧠 Motivation
+Byte-Pair Encoding (BPE) tokenizers trained on English or Latin-script languages often fail to tokenize Geez-script languages efficiently. This tokenizer aims to:
+- Reduce over-segmentation errors
+- Respect morpheme boundaries
+- Improve language understanding for downstream tasks like Machine Translation and QA
+## 📚 Training Details
+- **Tokenizer Type**: BPE
+- **Vocabulary Size**: 32,000
+- **Pre-tokenizer**: Whitespace
+- **Normalizer**: NFD → Lowercase → StripAccents
+- **Special Tokens**: `[PAD]`, `[UNK]`, `[CLS]`, `[SEP]`, `[MASK]`
+- **Post-processing**: Template for `[CLS] $A [SEP]` and `[CLS] $A [SEP] $B [SEP]`
+## 📁 Files
+- `vocab.json`: Vocabulary file
+- `merges.txt`: Merge rules for BPE
+- `tokenizer.json`: Full tokenizer config
+- `tokenizer_config.json`: Hugging Face-compatible configuration
+- `special_tokens_map.json`: Maps for special tokens
+## 🚀 Usage
+```python
+from transformers import PreTrainedTokenizerFast
+tokenizer = PreTrainedTokenizerFast.from_pretrained("Hailay/geez-tokenizer")
+text = "የግብፅ አርኪኦሎጂስቶች በሳኩራ ኔክሮፖሊስ ውስጥ የተገኘውን ትልቁን መቃብር አግኝተዋል።"
+tokens = tokenizer.tokenize(text)
+ids = tokenizer.encode(text)
+print("Tokens:", tokens)
+print("Token IDs:", ids)
+## 📊 Intended Use
+This tokenizer is best suited for:
+Low-resource NLP pipelines
+Machine Translation
+Question Answering
+Named Entity Recognition
+Morphological analysis
+✋ #**Limitations**
+It is optimized for Geez-script languages and might not generalize to others.
+Some compound verbs and morphologically fused words may still require linguistic preprocessing.
+Currently monolingual for Amharic and Tigrinya; does not support multilingual code-switching.
+✅ #**Evaluation**
+The tokenizer was evaluated manually on:
+Token coverage of Tigrinya/Amharic corpora
+Morphological preservation
+Reduction of BPE segmentation errors
+Quantitative metrics to be published in an accompanying paper.
+📜 #**License**
+This tokenizer is licensed under the MIT License.
+📌 #**Citation**
+@misc{hailay2025geez,
+  title={Geʽez Script_Tokenizer: A Morpheme-Aware BPE Tokenizer for Geez Script Languages},
+  author={Teklehaymanot, Hailay},
+  year={2025},
+  howpublished={\url{https://huggingface.co/Hailay/geez-tokenizer}},
+}