Instructions to use Nj-1111/Copernicus-Tokenizer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Nj-1111/Copernicus-Tokenizer with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Nj-1111/Copernicus-Tokenizer", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Copernicus Tokenizer
Domain-general BPE tokenizer trained from scratch on 3.96 million documents spanning natural language, code, mathematics, and scientific text.
| Parameter | Value |
|---|---|
| Algorithm | Byte-Pair Encoding (BPE) |
| Vocabulary size | 32,685 |
| Merges | 32,493 |
| Byte encoding | GPT-2 byte-level (256-char alphabet) |
| Min frequency | 3 |
Quick start
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Nj-1111/Copernicus-Tokenizer")
ids = tokenizer("Hello, world!")
print(ids)
Use in a training loop
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("Nj-1111/Copernicus-Tokenizer")
inputs = tokenizer(
["Hello world", "def foo(): pass"],
truncation=True,
max_length=2048,
padding="max_length",
return_tensors="pt",
)
Special tokens
| Token | Role |
|---|---|
<|endoftext|> |
BOS / EOS |
<|unk|> |
Unknown |
<|pad|> |
Padding |
<think> / </think> |
Chain-of-thought delimiters |
<|user|> / <|assistant|> / <|system|> |
Chat roles |
<|im_start|> / <|im_end|> |
ChatML-style markers |
<|tool_call|> / <|tool_result|> |
Tool use |
Training data
| Domain | Source |
|---|---|
| Natural language | Wikipedia (multilingual), Common Crawl |
| Code | The Stack |
| Mathematics | MATH dataset, arXiv |
| Science | PubMed, S2ORC |
Training code: github.com/Nj-1111/copernicus-tokenizer
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Nj-1111/Copernicus-Tokenizer", dtype="auto")