How to use from the
Use from the
Transformers library
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Nj-1111/Copernicus-Tokenizer", dtype="auto")
Quick Links

Copernicus Tokenizer

Domain-general BPE tokenizer trained from scratch on 3.96 million documents spanning natural language, code, mathematics, and scientific text.

Parameter Value
Algorithm Byte-Pair Encoding (BPE)
Vocabulary size 32,685
Merges 32,493
Byte encoding GPT-2 byte-level (256-char alphabet)
Min frequency 3

Quick start

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Nj-1111/Copernicus-Tokenizer")

ids = tokenizer("Hello, world!")
print(ids)

Use in a training loop

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("Nj-1111/Copernicus-Tokenizer")

inputs = tokenizer(
    ["Hello world", "def foo(): pass"],
    truncation=True,
    max_length=2048,
    padding="max_length",
    return_tensors="pt",
)

Special tokens

Token Role
<|endoftext|> BOS / EOS
<|unk|> Unknown
<|pad|> Padding
<think> / </think> Chain-of-thought delimiters
<|user|> / <|assistant|> / <|system|> Chat roles
<|im_start|> / <|im_end|> ChatML-style markers
<|tool_call|> / <|tool_result|> Tool use

Training data

Domain Source
Natural language Wikipedia (multilingual), Common Crawl
Code The Stack
Mathematics MATH dataset, arXiv
Science PubMed, S2ORC

Training code: github.com/Nj-1111/copernicus-tokenizer

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support