ogbert-tokenizer-8k / README.md
mjbommar's picture
Upload OGBERT tokenizer (vocab_size=8192)
d8255ec verified
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- tokenizer
- bpe
- ogbert
- modernbert
- opengloss
---
# OGBERT Tokenizer (8K)
A 8,192-token BPE tokenizer for [OpenGloss](https://arxiv.org/abs/2511.18622) OGBERT embedding models.
## Usage
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mjbommar/ogbert-tokenizer-8k")
tokens = tokenizer.encode("hello world")
```
## Details
- **Vocab Size**: 8,192 (power of 2)
- **Space Token**: ID 8191
- **Special Tokens**: IDs 0-6 (`<|start|>`, `<|end|>`, `<|pad|>`, `<|unk|>`, `<|cls|>`, `<|sep|>`, `<|mask|>`)
- **Training Data**: [mjbommar/ogbert-v1-mlm](https://huggingface.co/datasets/mjbommar/ogbert-v1-mlm)
## Citation
```bibtex
@misc{bommarito2025opengloss,
title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph},
author={Michael J. Bommarito II},
year={2025},
eprint={2511.18622},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
## License
Apache 2.0