| license: mit | |
| language: | |
| - arb | |
| tags: | |
| - tokenizer | |
| - bpe | |
| - flexitok | |
| - fineweb2 | |
| # Byte-Level BPE Tokenizer: arb_Arab (16K) | |
| A **Byte-Level BPE** tokenizer trained on **arb_Arab** data from Fineweb-2-HQ. | |
| ## Training Details | |
| | Parameter | Value | | |
| |-----------|-------| | |
| | Algorithm | Byte-Level BPE | | |
| | Language | `arb_Arab` | | |
| | Target Vocab Size | 16,000 | | |
| | Final Vocab Size | 0 | | |
| | Pre-tokenizer | ByteLevel | | |
| | Normalizer | NFC | | |
| | Special Tokens | `<s>`, `</s>`, `<pad>`, `<unk>` | | |
| | Training Shards | 2 | | |
| ## Usage | |
| ```python | |
| from transformers import AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained("flexitok/-bpe_arb_Arab_16000") | |
| tokens = tokenizer.encode("Hello, world!") | |
| ``` | |
| ## Files | |
| - `tokenizer.json` — Full HuggingFace tokenizer | |
| - `vocab.json` — Vocabulary mapping | |
| - `merges.txt` — BPE merge rules | |