mnnobi
/

Nexus-Universal-Tokenizer

Model card Files Files and versions

Nexus-Universal-Tokenizer / README.md

mnnobi's picture

Update README.md

b7e4ac8 verified 3 months ago

|

history blame contribute delete

1.72 kB

	# Nexus Universal Tokenizer

	This is the foundational Byte-Level BPE (Byte-Pair Encoding) Tokenizer built for the Nexus series of Large Language Models (starting with Nexus2 1B).

	It was designed to be highly efficient, natively supporting Multilingual text, Code, Mathematics, Medical literature, Vision, Tool Calling, and Reasoning via specialized control tokens.

	## Architecture
	- Base Style: Qwen2 / ByteLevel BPE
	- Vocabulary Size: 151,936
	- Chat Template: ChatML (`<\|im_start\|>` / `<\|im_end\|>`)

	## Special / Control Tokens
	This tokenizer natively includes modern structural tokens to allow reasoning and agentic behaviors:
	* `<think>` / `</think>` (For Chain-of-Thought reasoning prior to output)
	* `<call>` / `</call>` (For function/API tool calling)
	* `<\|im_start\|>` / `<\|im_end\|>` (ChatML role structures)
	* `<\|image_pad\|>` (For Vision-Language-Model projection mapping)
	* `<\|endoftext\|>` (Standard EOS)
	* `<\|pad\|>` (Standard PAD)

	## Training Data ("The Soup")
	The tokenizer was trained from scratch on a curated 1-Million sample stream to ensure balanced sub-word tokenization across multiple domains:

	1. General Web (30%): `HuggingFaceFW/fineweb-edu` (300,000 samples)
	2. Code (25%): `bigcode/the-stack-smol-xl` (250,000 samples)
	3. Math & STEM (20%): `open-web-math/open-web-math` (200,000 samples)
	4. Medical (15%): `medalpaca/medical_meadow_wikidoc` (150,000 samples)
	5. Multilingual (10%): `CohereForAI/aya_dataset` (100,000 samples)

	## Usage in Transformers
	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("mnnobi/Nexus-Universal-Tokenizer")
	print(tokenizer.encode("Hello world! <think> This is a test. </think>"))