Upload OGBERT tokenizer (vocab_size=16384)

ae3b88b verified 3 months ago

1.06 kB

language:
  - en
license: apache-2.0
library_name: transformers
tags:
  - tokenizer
  - bpe
  - ogbert
  - modernbert
  - opengloss

OGBERT Tokenizer (16K)

A 16,384-token BPE tokenizer for OpenGloss OGBERT embedding models.

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mjbommar/ogbert-tokenizer-16k")
tokens = tokenizer.encode("hello world")

Details

Vocab Size: 16,384 (power of 2)
Space Token: ID 16383
Special Tokens: IDs 0-6 (<|start|>, <|end|>, <|pad|>, <|unk|>, <|cls|>, <|sep|>, <|mask|>)
Training Data: mjbommar/opengloss-v1.1-dictionary

Citation

@misc{bommarito2025opengloss,
    title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph},
    author={Michael J. Bommarito II},
    year={2025},
    eprint={2511.18622},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

License

Apache 2.0