Upload folder using huggingface_hub

Browse files

Files changed (5) hide show

README.md +131 -0
chat_template.jinja +50 -0
special_tokens_map.json +164 -0
tokenizer.json +0 -0
tokenizer_config.json +199 -0

README.md ADDED Viewed

	@@ -0,0 +1,131 @@

+---
+library_name: transformers
+license: apache-2.0
+language:
+  - en
+tags:
+  - tokenizer
+  - bpe
+  - byte-level
+  - chatml
+  - tool-use
+  - code
+  - python
+pipeline_tag: text-generation
+datasets:
+  - nvidia/Nemotron-CC-HQ
+  - HuggingFaceTB/smoltalk
+  - sahil2801/CodeAlpaca-20k
+---
+# Daisy Tokenizer v2
+Custom byte-level BPE tokenizer trained for the Daisy language model, optimized for Python code and instruction-following tasks.
+## Details
+| Property | Value |
+|----------|-------|
+| **Vocabulary size** | 49,152 |
+| **Algorithm** | Byte-level BPE |
+| **Pre-tokenizer** | Llama-3 style regex |
+| **Chat format** | ChatML |
+| **Max length** | 131,072 tokens |
+| **Training date** | 2026-01-14 |
+## Features
+- **Python-optimized**: Trained on Python code for efficient tokenization
+- **Tool calling**: Native support for `<|tool_call|>` / `<|tool_result|>` patterns
+- **Inline computation**: Support for `<|python|>` / `<|output|>` for calculator-style reasoning
+- **Chain-of-thought**: `<|think|>` tokens for reasoning blocks
+- **No UNK tokens**: Byte-level fallback handles any Unicode input
+## Special Tokens
+| Token                | ID    | Purpose                    |
+|----------------------|-------|----------------------------|
+| `<\|endoftext\|>`    | 49131 | End of sequence / BOS      |
+| `<\|pad\|>`          | 49132 | Padding token              |
+| `<\|im_start\|>`     | 49133 | Start of message (ChatML)  |
+| `<\|im_end\|>`       | 49134 | End of message (ChatML)    |
+| `<\|tool_call\|>`    | 49135 | Start of tool call         |
+| `<\|/tool_call\|>`   | 49136 | End of tool call           |
+| `<\|tool_result\|>`  | 49137 | Start of tool result       |
+| `<\|/tool_result\|>` | 49138 | End of tool result         |
+| `<\|python\|>`       | 49139 | Start of Python expression |
+| `<\|/python\|>`      | 49140 | End of Python expression   |
+| `<\|output\|>`       | 49141 | Start of computed output   |
+| `<\|/output\|>`      | 49142 | End of computed output     |
+| `<\|think\|>`        | 49143 | Start of reasoning block   |
+| `<\|/think\|>`       | 49144 | End of reasoning block     |
+| `<\|system\|>`       | 49145 | System role marker         |
+| `<\|user\|>`         | 49146 | User role marker           |
+| `<\|assistant\|>`    | 49147 | Assistant role marker      |
+| `<\|reserved_0\|>`   | 49148 | Reserved                   |
+| `<\|reserved_1\|>`   | 49149 | Reserved                   |
+| `<\|reserved_2\|>`   | 49150 | Reserved                   |
+| `<\|reserved_3\|>`   | 49151 | Reserved                   |
+## Usage
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("jonathanmiddleton/daisy-tokenizer-v2")
+# Basic encoding
+tokens = tokenizer.encode("Hello, world!")
+# Chat formatting
+messages = [
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": "Hello!"},
+    {"role": "assistant", "content": "Hi there! How can I help you?"},
+]
+text = tokenizer.apply_chat_template(messages, tokenize=False)
+```
+## Chat Template Format
+```
+<|im_start|>system
+{system_message}<|im_end|>
+<|im_start|>user
+{user_message}<|im_end|>
+<|im_start|>assistant
+{assistant_message}<|im_end|>
+```
+### Tool Calling Example
+```
+<|im_start|>assistant
+Let me calculate that for you.
+<|tool_call|>{"name": "calculator", "arguments": {"expression": "2 + 2"}}<|/tool_call|>
+<|tool_result|>4<|/tool_result|>
+The answer is 4.<|im_end|>
+```
+## Compression Ratios
+| Content Type | Chars/Token | vs GPT-2 |
+|--------------|-------------|----------|
+| English prose | ~4.0 | baseline |
+| Python code | ~3.8 | +15% better |
+Run validation to see detailed compression ratios:
+```bash
+python tools/validate_tokenizer.py --tokenizer tokenizer/daisy-v2
+```
+## Training Data
+- **General text**: nvidia/Nemotron-CC-HQ (~60%)
+- **Python code**: bigcode/the-stack-dedup, CodeAlpaca (~25%)
+- **Instructions**: HuggingFaceTB/smoltalk, OpenHermes (~15%)
+## License
+Apache 2.0

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,50 @@

+{#- Daisy Chat Template v2 -#}
+{#- Supports: ChatML format, tool calling, multipart content -#}
+{#- Macro to render content (string or multipart) -#}
+{%- macro render_content(content) -%}
+{%- if content is string -%}
+{{ content }}
+{%- elif content is iterable -%}
+{%- for part in content -%}
+{%- if part.type == 'text' -%}
+{{ part.text }}
+{%- elif part.type == 'tool_call' -%}
+<|tool_call|>{{ part.text }}<|/tool_call|>
+{%- elif part.type == 'tool_result' -%}
+<|tool_result|>{{ part.text }}<|/tool_result|>
+{%- elif part.type == 'python' -%}
+<|python|>{{ part.text }}<|/python|>
+{%- elif part.type == 'output' -%}
+<|output|>{{ part.text }}<|/output|>
+{%- elif part.type == 'think' -%}
+<|think|>{{ part.text }}<|/think|>
+{%- endif -%}
+{%- endfor -%}
+{%- else -%}
+{{ content }}
+{%- endif -%}
+{%- endmacro -%}
+{#- Main message loop -#}
+{%- for message in messages -%}
+{%- if message.role == 'system' -%}
+<|im_start|>system
+{{ message.content }}<|im_end|>
+{% elif message.role == 'user' -%}
+<|im_start|>user
+{{ message.content }}<|im_end|>
+{% elif message.role == 'assistant' -%}
+<|im_start|>assistant
+{% generation %}{{ render_content(message.content) }}{% endgeneration %}<|im_end|>
+{% elif message.role == 'tool' -%}
+<|tool_result|>{{ message.content }}<|/tool_result|>
+{%- endif -%}
+{%- endfor -%}
+{#- Generation prompt -#}
+{%- if add_generation_prompt -%}
+<|im_start|>assistant
+{% generation %}{% endgeneration %}
+{%- endif -%}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,164 @@

+{
+  "bos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false,
+    "special": true
+  },
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false,
+    "special": true
+  },
+  "pad_token": {
+    "content": "<|pad|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false,
+    "special": true
+  },
+  "additional_special_tokens": [
+    {
+      "content": "<|tool_call|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    {
+      "content": "<|/tool_call|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    {
+      "content": "<|tool_result|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    {
+      "content": "<|/tool_result|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    {
+      "content": "<|python|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    {
+      "content": "<|/python|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    {
+      "content": "<|output|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    {
+      "content": "<|/output|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    {
+      "content": "<|think|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    {
+      "content": "<|/think|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    {
+      "content": "<|system|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    {
+      "content": "<|user|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    {
+      "content": "<|assistant|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    {
+      "content": "<|reserved_0|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    {
+      "content": "<|reserved_1|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    {
+      "content": "<|reserved_2|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    {
+      "content": "<|reserved_3|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  ]
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,199 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "49131": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49132": {
+      "content": "<|pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49133": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49134": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49135": {
+      "content": "<|tool_call|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49136": {
+      "content": "<|/tool_call|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49137": {
+      "content": "<|tool_result|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49138": {
+      "content": "<|/tool_result|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49139": {
+      "content": "<|python|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49140": {
+      "content": "<|/python|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49141": {
+      "content": "<|output|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49142": {
+      "content": "<|/output|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49143": {
+      "content": "<|think|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49144": {
+      "content": "<|/think|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49145": {
+      "content": "<|system|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49146": {
+      "content": "<|user|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49147": {
+      "content": "<|assistant|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49148": {
+      "content": "<|reserved_0|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49149": {
+      "content": "<|reserved_1|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49150": {
+      "content": "<|reserved_2|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49151": {
+      "content": "<|reserved_3|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|tool_call|>",
+    "<|/tool_call|>",
+    "<|tool_result|>",
+    "<|/tool_result|>",
+    "<|python|>",
+    "<|/python|>",
+    "<|output|>",
+    "<|/output|>",
+    "<|think|>",
+    "<|/think|>",
+    "<|system|>",
+    "<|user|>",
+    "<|assistant|>",
+    "<|reserved_0|>",
+    "<|reserved_1|>",
+    "<|reserved_2|>",
+    "<|reserved_3|>"
+  ],
+  "bos_token": "<|endoftext|>",
+  "eos_token": "<|endoftext|>",
+  "pad_token": "<|pad|>",
+  "unk_token": null,
+  "clean_up_tokenization_spaces": false,
+  "model_max_length": 131072,
+  "tokenizer_class": "PreTrainedTokenizerFast"
+}