YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

TokTok Spanish Tokenizer

BPE tokenizer trained on Spanish Wikipedia using SentencePiece.

Details

  • Vocab size: 8,000
  • Model type: BPE
  • Training data: Spanish Wikipedia (100K articles)
  • Character coverage: 99.95%
  • Byte fallback: enabled

Usage

import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.load("tokenizer.model")

text = "El procesamiento de lenguaje natural es fascinante."
tokens = sp.encode_as_pieces(text)
print(tokens)

Why?

English-centric tokenizers (GPT-4, Llama) use 20-40% more tokens on Spanish text. This tokenizer is optimized for Spanish, giving better compression and cheaper inference.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support