| license: mit | |
| language: | |
| - und # ISO 639-3 code or "und" if not identifiable | |
| tags: | |
| - tokenizer | |
| - bpe | |
| - flexitok | |
| - fineweb2 | |
| # Byte-Level BPE Tokenizer: numeric (0K) | |
| A **Byte-Level BPE** tokenizer trained on **numeric** data from Fineweb-2-HQ. | |
| ## Training Details | |
| | Parameter | Value | | |
| |-----------|-------| | |
| | Algorithm | Byte-Level BPE | | |
| | Language | `numeric` | | |
| | Target Vocab Size | 1 | | |
| | Final Vocab Size | 18 | | |
| | Pre-tokenizer | byte_level | | |
| | Number handling | rtl_2digit | | |
| | Contraction handling | False | | |
| | Normalizer | NONE | | |
| | Special Tokens | `<s>`, `</s>`, `<pad>`, `<unk>` | | |
| | Training Shards | 1 | | |
| ## Usage | |
| ```python | |
| from transformers import AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained("flexitok/mod-tokenizers-bpe_numeric_1") | |
| tokens = tokenizer.encode("Hello, world!") | |
| ``` | |
| ## Files | |
| - `tokenizer.json` — Full HuggingFace tokenizer | |
| - `vocab.json` — Vocabulary mapping | |
| - `merges.txt` — BPE merge rules | |