Upload folder using huggingface_hub

6cba08c verified 4 days ago

1.67 kB

license: apache-2.0
tags:
  - tokenizer
  - bpe
  - nlp
  - llm
library_name: transformers

Copernicus Tokenizer

Domain-general BPE tokenizer trained from scratch on 3.96 million documents spanning natural language, code, mathematics, and scientific text.

Parameter	Value
Algorithm	Byte-Pair Encoding (BPE)
Vocabulary size	32,685
Merges	32,493
Byte encoding	GPT-2 byte-level (256-char alphabet)
Min frequency	3

Quick start

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Nj-1111/Copernicus-Tokenizer")

ids = tokenizer("Hello, world!")
print(ids)

Use in a training loop

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("Nj-1111/Copernicus-Tokenizer")

inputs = tokenizer(
    ["Hello world", "def foo(): pass"],
    truncation=True,
    max_length=2048,
    padding="max_length",
    return_tensors="pt",
)

Special tokens

Token	Role
`<\|endoftext\|>`	BOS / EOS
`<\|unk\|>`	Unknown
`<\|pad\|>`	Padding
`<think>` / `</think>`	Chain-of-thought delimiters
`<\|user\|>` / `<\|assistant\|>` / `<\|system\|>`	Chat roles
`<\|im_start\|>` / `<\|im_end\|>`	ChatML-style markers
`<\|tool_call\|>` / `<\|tool_result\|>`	Tool use

Training data

Domain	Source
Natural language	Wikipedia (multilingual), Common Crawl
Code	The Stack
Mathematics	MATH dataset, arXiv
Science	PubMed, S2ORC

Training code: github.com/Nj-1111/copernicus-tokenizer