alisawuffles's picture
Update README.md
fdccc3c verified
metadata
license: mit
datasets:
  - UW/olmo-mix-1124-subset-p99

We developed this SuperBPE tokenizer for model developers who wish to experiment quickly with an off-the-shelf tokenizer in their pretraining pipeline! This is an English SuperBPE tokenizer with a vocab size of 128K, trained on a subset of the Olmo2 pretraining data.

You can experiment with this tokenizer on our tokenizer playground by entering a custom HF repository ID.