Upload folder using huggingface_hub

Browse files

Files changed (5) hide show

README.md +72 -0
merges.txt +0 -0
special_tokens_map.json +164 -0
tokenizer_config.json +28 -0
vocab.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,72 @@

+---
+license: apache-2.0
+tags:
+  - tokenizer
+  - bpe
+  - nlp
+  - llm
+library_name: transformers
+---
+# Copernicus Tokenizer
+Domain-general BPE tokenizer trained from scratch on 3.96 million documents
+spanning natural language, code, mathematics, and scientific text.
+| Parameter | Value |
+|---|---|
+| Algorithm | Byte-Pair Encoding (BPE) |
+| Vocabulary size | 32,685 |
+| Merges | 32,493 |
+| Byte encoding | GPT-2 byte-level (256-char alphabet) |
+| Min frequency | 3 |
+## Quick start
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("Nj-1111/Copernicus-Tokenizer")
+ids = tokenizer("Hello, world!")
+print(ids)
+```
+## Use in a training loop
+```python
+from transformers import PreTrainedTokenizerFast
+tokenizer = PreTrainedTokenizerFast.from_pretrained("Nj-1111/Copernicus-Tokenizer")
+inputs = tokenizer(
+    ["Hello world", "def foo(): pass"],
+    truncation=True,
+    max_length=2048,
+    padding="max_length",
+    return_tensors="pt",
+)
+```
+## Special tokens
+| Token | Role |
+|---|---|
+| `<\|endoftext\|>` | BOS / EOS |
+| `<\|unk\|>` | Unknown |
+| `<\|pad\|>` | Padding |
+| `<think>` / `</think>` | Chain-of-thought delimiters |
+| `<\|user\|>` / `<\|assistant\|>` / `<\|system\|>` | Chat roles |
+| `<\|im_start\|>` / `<\|im_end\|>` | ChatML-style markers |
+| `<\|tool_call\|>` / `<\|tool_result\|>` | Tool use |
+## Training data
+| Domain | Source |
+|---|---|
+| Natural language | Wikipedia (multilingual), Common Crawl |
+| Code | The Stack |
+| Mathematics | MATH dataset, arXiv |
+| Science | PubMed, S2ORC |
+Training code: [github.com/Nj-1111/copernicus-tokenizer](https://github.com/Nj-1111/copernicus-tokenizer)

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,164 @@

+{
+  "bos_token": {
+    "content": "<|endoftext|>",
+    "single_word": false,
+    "lstrip": false,
+    "rstrip": false,
+    "normalized": false,
+    "special": true
+  },
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "single_word": false,
+    "lstrip": false,
+    "rstrip": false,
+    "normalized": false,
+    "special": true
+  },
+  "unk_token": {
+    "content": "<|unk|>",
+    "single_word": false,
+    "lstrip": false,
+    "rstrip": false,
+    "normalized": false,
+    "special": true
+  },
+  "pad_token": {
+    "content": "<|pad|>",
+    "single_word": false,
+    "lstrip": false,
+    "rstrip": false,
+    "normalized": false,
+    "special": true
+  },
+  "additional_special_tokens": [
+    {
+      "content": "<think>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    {
+      "content": "</think>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    {
+      "content": "<scratchpad>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    {
+      "content": "</scratchpad>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    {
+      "content": "<verify>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    {
+      "content": "</verify>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    {
+      "content": "<reflect>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    {
+      "content": "</reflect>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    {
+      "content": "<|user|>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    {
+      "content": "<|assistant|>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    {
+      "content": "<|system|>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    {
+      "content": "<|tool_call|>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    {
+      "content": "<|tool_result|>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    {
+      "content": "<|sep|>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    {
+      "content": "<|im_start|>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    {
+      "content": "<|im_end|>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    }
+  ]
+}

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "model_max_length": 4096,
+  "bos_token": "<|endoftext|>",
+  "eos_token": "<|endoftext|>",
+  "unk_token": "<|unk|>",
+  "pad_token": "<|pad|>",
+  "additional_special_tokens": [
+    "<think>",
+    "</think>",
+    "<scratchpad>",
+    "</scratchpad>",
+    "<verify>",
+    "</verify>",
+    "<reflect>",
+    "</reflect>",
+    "<|user|>",
+    "<|assistant|>",
+    "<|system|>",
+    "<|tool_call|>",
+    "<|tool_result|>",
+    "<|sep|>",
+    "<|im_start|>",
+    "<|im_end|>"
+  ],
+  "clean_up_tokenization_spaces": false,
+  "add_prefix_space": false
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff