SentencePiece Tokenizer for Code & Logs
BPE tokenizer trained on CodeSearchNet (64M code + comment examples).
Vocabulary
- Size: 32,000 tokens
- Coverage: 1.0 (all characters)
- Special tokens:
<pad>,<unk>,<s>,</s>
Usage
import sentencepiece as spm
sp = spm.SentencePieceProcessor(model_file="sentencepiece.model")
tokens = sp.encode("def hello_world():", out_type=int)
text = sp.decode(tokens)
Training Data
- CodeSearchNet: Python, Java, JavaScript, Go, Ruby, PHP
- ~64M sentences (code + documentation)
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support