Nj-1111's picture
Upload folder using huggingface_hub
6cba08c verified
---
license: apache-2.0
tags:
- tokenizer
- bpe
- nlp
- llm
library_name: transformers
---
# Copernicus Tokenizer
Domain-general BPE tokenizer trained from scratch on 3.96 million documents
spanning natural language, code, mathematics, and scientific text.
| Parameter | Value |
|---|---|
| Algorithm | Byte-Pair Encoding (BPE) |
| Vocabulary size | 32,685 |
| Merges | 32,493 |
| Byte encoding | GPT-2 byte-level (256-char alphabet) |
| Min frequency | 3 |
## Quick start
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Nj-1111/Copernicus-Tokenizer")
ids = tokenizer("Hello, world!")
print(ids)
```
## Use in a training loop
```python
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("Nj-1111/Copernicus-Tokenizer")
inputs = tokenizer(
["Hello world", "def foo(): pass"],
truncation=True,
max_length=2048,
padding="max_length",
return_tensors="pt",
)
```
## Special tokens
| Token | Role |
|---|---|
| `<\|endoftext\|>` | BOS / EOS |
| `<\|unk\|>` | Unknown |
| `<\|pad\|>` | Padding |
| `<think>` / `</think>` | Chain-of-thought delimiters |
| `<\|user\|>` / `<\|assistant\|>` / `<\|system\|>` | Chat roles |
| `<\|im_start\|>` / `<\|im_end\|>` | ChatML-style markers |
| `<\|tool_call\|>` / `<\|tool_result\|>` | Tool use |
## Training data
| Domain | Source |
|---|---|
| Natural language | Wikipedia (multilingual), Common Crawl |
| Code | The Stack |
| Mathematics | MATH dataset, arXiv |
| Science | PubMed, S2ORC |
Training code: [github.com/Nj-1111/copernicus-tokenizer](https://github.com/Nj-1111/copernicus-tokenizer)