UW/olmo-mix-1124-subset-p99
Updated โข 435 โข 3
We developed this SuperBPE tokenizer for model developers who wish to experiment quickly with an off-the-shelf tokenizer in their pretraining pipeline! This is an English SuperBPE tokenizer with a vocab size of 128K, trained on a subset of the Olmo2 pretraining data.
You can experiment with this tokenizer on our tokenizer playground by entering a custom HF repository ID.