Instructions to use Nj-1111/Copernicus-Tokenizer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Nj-1111/Copernicus-Tokenizer with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Nj-1111/Copernicus-Tokenizer", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| tags: | |
| - tokenizer | |
| - bpe | |
| - nlp | |
| - llm | |
| library_name: transformers | |
| # Copernicus Tokenizer | |
| Domain-general BPE tokenizer trained from scratch on 3.96 million documents | |
| spanning natural language, code, mathematics, and scientific text. | |
| | Parameter | Value | | |
| |---|---| | |
| | Algorithm | Byte-Pair Encoding (BPE) | | |
| | Vocabulary size | 32,685 | | |
| | Merges | 32,493 | | |
| | Byte encoding | GPT-2 byte-level (256-char alphabet) | | |
| | Min frequency | 3 | | |
| ## Quick start | |
| ```python | |
| from transformers import AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained("Nj-1111/Copernicus-Tokenizer") | |
| ids = tokenizer("Hello, world!") | |
| print(ids) | |
| ``` | |
| ## Use in a training loop | |
| ```python | |
| from transformers import PreTrainedTokenizerFast | |
| tokenizer = PreTrainedTokenizerFast.from_pretrained("Nj-1111/Copernicus-Tokenizer") | |
| inputs = tokenizer( | |
| ["Hello world", "def foo(): pass"], | |
| truncation=True, | |
| max_length=2048, | |
| padding="max_length", | |
| return_tensors="pt", | |
| ) | |
| ``` | |
| ## Special tokens | |
| | Token | Role | | |
| |---|---| | |
| | `<\|endoftext\|>` | BOS / EOS | | |
| | `<\|unk\|>` | Unknown | | |
| | `<\|pad\|>` | Padding | | |
| | `<think>` / `</think>` | Chain-of-thought delimiters | | |
| | `<\|user\|>` / `<\|assistant\|>` / `<\|system\|>` | Chat roles | | |
| | `<\|im_start\|>` / `<\|im_end\|>` | ChatML-style markers | | |
| | `<\|tool_call\|>` / `<\|tool_result\|>` | Tool use | | |
| ## Training data | |
| | Domain | Source | | |
| |---|---| | |
| | Natural language | Wikipedia (multilingual), Common Crawl | | |
| | Code | The Stack | | |
| | Mathematics | MATH dataset, arXiv | | |
| | Science | PubMed, S2ORC | | |
| Training code: [github.com/Nj-1111/copernicus-tokenizer](https://github.com/Nj-1111/copernicus-tokenizer) | |