Nj-1111's picture
Upload folder using huggingface_hub
6cba08c verified
metadata
license: apache-2.0
tags:
  - tokenizer
  - bpe
  - nlp
  - llm
library_name: transformers

Copernicus Tokenizer

Domain-general BPE tokenizer trained from scratch on 3.96 million documents spanning natural language, code, mathematics, and scientific text.

Parameter Value
Algorithm Byte-Pair Encoding (BPE)
Vocabulary size 32,685
Merges 32,493
Byte encoding GPT-2 byte-level (256-char alphabet)
Min frequency 3

Quick start

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Nj-1111/Copernicus-Tokenizer")

ids = tokenizer("Hello, world!")
print(ids)

Use in a training loop

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("Nj-1111/Copernicus-Tokenizer")

inputs = tokenizer(
    ["Hello world", "def foo(): pass"],
    truncation=True,
    max_length=2048,
    padding="max_length",
    return_tensors="pt",
)

Special tokens

Token Role
<|endoftext|> BOS / EOS
<|unk|> Unknown
<|pad|> Padding
<think> / </think> Chain-of-thought delimiters
<|user|> / <|assistant|> / <|system|> Chat roles
<|im_start|> / <|im_end|> ChatML-style markers
<|tool_call|> / <|tool_result|> Tool use

Training data

Domain Source
Natural language Wikipedia (multilingual), Common Crawl
Code The Stack
Mathematics MATH dataset, arXiv
Science PubMed, S2ORC

Training code: github.com/Nj-1111/copernicus-tokenizer