| # Nexus Universal Tokenizer |
|
|
| This is the foundational Byte-Level BPE (Byte-Pair Encoding) Tokenizer built for the **Nexus** series of Large Language Models (starting with Nexus2 1B). |
|
|
| It was designed to be highly efficient, natively supporting Multilingual text, Code, Mathematics, Medical literature, Vision, Tool Calling, and Reasoning via specialized control tokens. |
|
|
| ## Architecture |
| - **Base Style:** Qwen2 / ByteLevel BPE |
| - **Vocabulary Size:** 151,936 |
| - **Chat Template:** ChatML (`<|im_start|>` / `<|im_end|>`) |
|
|
| ## Special / Control Tokens |
| This tokenizer natively includes modern structural tokens to allow reasoning and agentic behaviors: |
| * `<think>` / `</think>` (For Chain-of-Thought reasoning prior to output) |
| * `<call>` / `</call>` (For function/API tool calling) |
| * `<|im_start|>` / `<|im_end|>` (ChatML role structures) |
| * `<|image_pad|>` (For Vision-Language-Model projection mapping) |
| * `<|endoftext|>` (Standard EOS) |
| * `<|pad|>` (Standard PAD) |
|
|
| ## Training Data ("The Soup") |
| The tokenizer was trained from scratch on a curated 1-Million sample stream to ensure balanced sub-word tokenization across multiple domains: |
|
|
| 1. **General Web (30%):** `HuggingFaceFW/fineweb-edu` (300,000 samples) |
| 2. **Code (25%):** `bigcode/the-stack-smol-xl` (250,000 samples) |
| 3. **Math & STEM (20%):** `open-web-math/open-web-math` (200,000 samples) |
| 4. **Medical (15%):** `medalpaca/medical_meadow_wikidoc` (150,000 samples) |
| 5. **Multilingual (10%):** `CohereForAI/aya_dataset` (100,000 samples) |
|
|
| ## Usage in Transformers |
| ```python |
| from transformers import AutoTokenizer |
| |
| tokenizer = AutoTokenizer.from_pretrained("mnnobi/Nexus-Universal-Tokenizer") |
| print(tokenizer.encode("Hello world! <think> This is a test. </think>")) |