YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Nexus Universal Tokenizer
This is the foundational Byte-Level BPE (Byte-Pair Encoding) Tokenizer built for the Nexus series of Large Language Models (starting with Nexus2 1B).
It was designed to be highly efficient, natively supporting Multilingual text, Code, Mathematics, Medical literature, Vision, Tool Calling, and Reasoning via specialized control tokens.
Architecture
- Base Style: Qwen2 / ByteLevel BPE
- Vocabulary Size: 151,936
- Chat Template: ChatML (
<|im_start|>/<|im_end|>)
Special / Control Tokens
This tokenizer natively includes modern structural tokens to allow reasoning and agentic behaviors:
<think>/</think>(For Chain-of-Thought reasoning prior to output)<call>/</call>(For function/API tool calling)<|im_start|>/<|im_end|>(ChatML role structures)<|image_pad|>(For Vision-Language-Model projection mapping)<|endoftext|>(Standard EOS)<|pad|>(Standard PAD)
Training Data ("The Soup")
The tokenizer was trained from scratch on a curated 1-Million sample stream to ensure balanced sub-word tokenization across multiple domains:
- General Web (30%):
HuggingFaceFW/fineweb-edu(300,000 samples) - Code (25%):
bigcode/the-stack-smol-xl(250,000 samples) - Math & STEM (20%):
open-web-math/open-web-math(200,000 samples) - Medical (15%):
medalpaca/medical_meadow_wikidoc(150,000 samples) - Multilingual (10%):
CohereForAI/aya_dataset(100,000 samples)
Usage in Transformers
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mnnobi/Nexus-Universal-Tokenizer")
print(tokenizer.encode("Hello world! <think> This is a test. </think>"))
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support