Tianqin-Meng
/

sentencepiece-code-logs-32k

Model card Files Files and versions

SentencePiece Tokenizer for Code & Logs

BPE tokenizer trained on CodeSearchNet (64M code + comment examples).

Vocabulary

Size: 32,000 tokens
Coverage: 1.0 (all characters)
Special tokens: <pad>, <unk>, <s>, </s>

Usage

import sentencepiece as spm

sp = spm.SentencePieceProcessor(model_file="sentencepiece.model")
tokens = sp.encode("def hello_world():", out_type=int)
text = sp.decode(tokens)

Training Data

CodeSearchNet: Python, Java, JavaScript, Go, Ruby, PHP
~64M sentences (code + documentation)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support