| language: | |
| - en | |
| license: apache-2.0 | |
| library_name: transformers | |
| tags: | |
| - tokenizer | |
| - bpe | |
| - ogbert | |
| - modernbert | |
| - opengloss | |
| # OGBERT Tokenizer (8K) | |
| A 8,192-token BPE tokenizer for [OpenGloss](https://arxiv.org/abs/2511.18622) OGBERT embedding models. | |
| ## Usage | |
| ```python | |
| from transformers import AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained("mjbommar/ogbert-tokenizer-8k") | |
| tokens = tokenizer.encode("hello world") | |
| ``` | |
| ## Details | |
| - **Vocab Size**: 8,192 (power of 2) | |
| - **Space Token**: ID 8191 | |
| - **Special Tokens**: IDs 0-6 (`<|start|>`, `<|end|>`, `<|pad|>`, `<|unk|>`, `<|cls|>`, `<|sep|>`, `<|mask|>`) | |
| - **Training Data**: [mjbommar/ogbert-v1-mlm](https://huggingface.co/datasets/mjbommar/ogbert-v1-mlm) | |
| ## Citation | |
| ```bibtex | |
| @misc{bommarito2025opengloss, | |
| title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph}, | |
| author={Michael J. Bommarito II}, | |
| year={2025}, | |
| eprint={2511.18622}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CL} | |
| } | |
| ``` | |
| ## License | |
| Apache 2.0 | |