We developed this SuperBPE tokenizer for model developers who wish to experiment quickly with an off-the-shelf tokenizer in their pretraining pipeline! This is an English SuperBPE tokenizer with a vocab size of 128K, trained on a subset of the Olmo2 pretraining data.

You can experiment with this tokenizer on our tokenizer playground by entering a custom HF repository ID.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

alisawuffles
/

superbpe-tokenizer-128k

Dataset used to train alisawuffles/superbpe-tokenizer-128k

Space using alisawuffles/superbpe-tokenizer-128k 1