--- license: apache-2.0 tags: - tokenizer - bpe - nlp - llm library_name: transformers --- # Copernicus Tokenizer Domain-general BPE tokenizer trained from scratch on 3.96 million documents spanning natural language, code, mathematics, and scientific text. | Parameter | Value | |---|---| | Algorithm | Byte-Pair Encoding (BPE) | | Vocabulary size | 32,685 | | Merges | 32,493 | | Byte encoding | GPT-2 byte-level (256-char alphabet) | | Min frequency | 3 | ## Quick start ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("Nj-1111/Copernicus-Tokenizer") ids = tokenizer("Hello, world!") print(ids) ``` ## Use in a training loop ```python from transformers import PreTrainedTokenizerFast tokenizer = PreTrainedTokenizerFast.from_pretrained("Nj-1111/Copernicus-Tokenizer") inputs = tokenizer( ["Hello world", "def foo(): pass"], truncation=True, max_length=2048, padding="max_length", return_tensors="pt", ) ``` ## Special tokens | Token | Role | |---|---| | `<\|endoftext\|>` | BOS / EOS | | `<\|unk\|>` | Unknown | | `<\|pad\|>` | Padding | | `` / `` | Chain-of-thought delimiters | | `<\|user\|>` / `<\|assistant\|>` / `<\|system\|>` | Chat roles | | `<\|im_start\|>` / `<\|im_end\|>` | ChatML-style markers | | `<\|tool_call\|>` / `<\|tool_result\|>` | Tool use | ## Training data | Domain | Source | |---|---| | Natural language | Wikipedia (multilingual), Common Crawl | | Code | The Stack | | Mathematics | MATH dataset, arXiv | | Science | PubMed, S2ORC | Training code: [github.com/Nj-1111/copernicus-tokenizer](https://github.com/Nj-1111/copernicus-tokenizer)