SentencePiece Tokenizer for Code & Logs

BPE tokenizer trained on CodeSearchNet (64M code + comment examples).

Vocabulary

  • Size: 32,000 tokens
  • Coverage: 1.0 (all characters)
  • Special tokens: <pad>, <unk>, <s>, </s>

Usage

import sentencepiece as spm

sp = spm.SentencePieceProcessor(model_file="sentencepiece.model")
tokens = sp.encode("def hello_world():", out_type=int)
text = sp.decode(tokens)

Training Data

  • CodeSearchNet: Python, Java, JavaScript, Go, Ruby, PHP
  • ~64M sentences (code + documentation)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support