File size: 1,669 Bytes
6cba08c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
license: apache-2.0
tags:
  - tokenizer
  - bpe
  - nlp
  - llm
library_name: transformers
---

# Copernicus Tokenizer

Domain-general BPE tokenizer trained from scratch on 3.96 million documents
spanning natural language, code, mathematics, and scientific text.

| Parameter | Value |
|---|---|
| Algorithm | Byte-Pair Encoding (BPE) |
| Vocabulary size | 32,685 |
| Merges | 32,493 |
| Byte encoding | GPT-2 byte-level (256-char alphabet) |
| Min frequency | 3 |

## Quick start

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Nj-1111/Copernicus-Tokenizer")

ids = tokenizer("Hello, world!")
print(ids)
```

## Use in a training loop

```python
from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("Nj-1111/Copernicus-Tokenizer")

inputs = tokenizer(
    ["Hello world", "def foo(): pass"],
    truncation=True,
    max_length=2048,
    padding="max_length",
    return_tensors="pt",
)
```

## Special tokens

| Token | Role |
|---|---|
| `<\|endoftext\|>` | BOS / EOS |
| `<\|unk\|>` | Unknown |
| `<\|pad\|>` | Padding |
| `<think>` / `</think>` | Chain-of-thought delimiters |
| `<\|user\|>` / `<\|assistant\|>` / `<\|system\|>` | Chat roles |
| `<\|im_start\|>` / `<\|im_end\|>` | ChatML-style markers |
| `<\|tool_call\|>` / `<\|tool_result\|>` | Tool use |

## Training data

| Domain | Source |
|---|---|
| Natural language | Wikipedia (multilingual), Common Crawl |
| Code | The Stack |
| Mathematics | MATH dataset, arXiv |
| Science | PubMed, S2ORC |

Training code: [github.com/Nj-1111/copernicus-tokenizer](https://github.com/Nj-1111/copernicus-tokenizer)