Update README.md

fdccc3c verified 5 months ago

467 Bytes

license: mit
datasets:
  - UW/olmo-mix-1124-subset-p99

We developed this SuperBPE tokenizer for model developers who wish to experiment quickly with an off-the-shelf tokenizer in their pretraining pipeline! This is an English SuperBPE tokenizer with a vocab size of 128K, trained on a subset of the Olmo2 pretraining data.

You can experiment with this tokenizer on our tokenizer playground by entering a custom HF repository ID.