Nexus Universal Tokenizer

This is the foundational Byte-Level BPE (Byte-Pair Encoding) Tokenizer built for the Nexus series of Large Language Models (starting with Nexus2 1B).

It was designed to be highly efficient, natively supporting Multilingual text, Code, Mathematics, Medical literature, Vision, Tool Calling, and Reasoning via specialized control tokens.

Architecture

Base Style: Qwen2 / ByteLevel BPE
Vocabulary Size: 151,936
Chat Template: ChatML (<|im_start|> / <|im_end|>)

Special / Control Tokens

This tokenizer natively includes modern structural tokens to allow reasoning and agentic behaviors:

<think> / </think> (For Chain-of-Thought reasoning prior to output)
<call> / </call> (For function/API tool calling)
<|im_start|> / <|im_end|> (ChatML role structures)
<|image_pad|> (For Vision-Language-Model projection mapping)
<|endoftext|> (Standard EOS)
<|pad|> (Standard PAD)

Training Data ("The Soup")

The tokenizer was trained from scratch on a curated 1-Million sample stream to ensure balanced sub-word tokenization across multiple domains:

General Web (30%): HuggingFaceFW/fineweb-edu (300,000 samples)
Code (25%): bigcode/the-stack-smol-xl (250,000 samples)
Math & STEM (20%): open-web-math/open-web-math (200,000 samples)
Medical (15%): medalpaca/medical_meadow_wikidoc (150,000 samples)
Multilingual (10%): CohereForAI/aya_dataset (100,000 samples)

Usage in Transformers

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mnnobi/Nexus-Universal-Tokenizer")
print(tokenizer.encode("Hello world! <think> This is a test. </think>"))

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support