--- language: - en license: apache-2.0 library_name: transformers tags: - tokenizer - bpe - ogbert - modernbert - opengloss --- # OGBERT Tokenizer (8K) A 8,192-token BPE tokenizer for [OpenGloss](https://arxiv.org/abs/2511.18622) OGBERT embedding models. ## Usage ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("mjbommar/ogbert-tokenizer-8k") tokens = tokenizer.encode("hello world") ``` ## Details - **Vocab Size**: 8,192 (power of 2) - **Space Token**: ID 8191 - **Special Tokens**: IDs 0-6 (`<|start|>`, `<|end|>`, `<|pad|>`, `<|unk|>`, `<|cls|>`, `<|sep|>`, `<|mask|>`) - **Training Data**: [mjbommar/ogbert-v1-mlm](https://huggingface.co/datasets/mjbommar/ogbert-v1-mlm) ## Citation ```bibtex @misc{bommarito2025opengloss, title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph}, author={Michael J. Bommarito II}, year={2025}, eprint={2511.18622}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ## License Apache 2.0