Proposal: use reserved tokens for agent/tool-use and multimodal control tokens
Proposal: Reserve Special Tokens for Agent, Tool, and Multimodal Workflows
Thank you for publishing openeurollm/tokenizer-256k. The tokenizer already looks strong for multilingual European coverage, and the existing <unused_0> to <unused_99> range seems like a useful opportunity to future-proof it before downstream training formats become fixed.
The current tokenizer already includes several helpful foundations:
<start_of_turn>
<end_of_turn>
<start_of_image>
<end_of_image>
<image_soft_token>
<fim_prefix>
<fim_middle>
<fim_suffix>
<tool_call>
</tool_call>
<unused_0> ... <unused_99>
I would suggest formalizing part of the unused range for modern agent, tool-use, structured-output, and multimodal workflows, without changing the total vocabulary size.
Motivation
Recent model families such as DeepSeek-V4, Gemma 4, and Qwen3.5 appear to be moving toward richer tokenizer-level conventions for:
- role and channel separation
- tool calls and tool responses
- structured JSON output
- browser, search, and RAG workflows
- multimodal placeholders
- grounding and object references
- reasoning or control boundaries
Special tokens do not make a model agentic by themselves. They only become useful if pretraining, SFT data, chat templates, tool traces, and evaluation harnesses use them consistently. But defining them early avoids format churn later and gives downstream training pipelines a stable convention to target.
Related Model Conventions
Several recent open model families already expose comparable conventions:
- DeepSeek-V4 documents an encoding format for multi-turn chat, tool calling, extended thinking, and quick-instruction tasks. Its reference encoder uses role markers,
<think>/</think>, a DSML marker, tool-result formatting, and quick-instruction tokens such as query, authority, domain, title, and read-url. - Gemma 4 advertises native support for agentic workflows and function calling. Its tokenizer configuration includes dedicated tokens for channels, turns, tool calls, tools, tool responses, thinking, images, audio, and video.
- Qwen3.5 exposes multimodal and agent-oriented tokenizer conventions, including
<|im_start|>,<|im_end|>, object-reference tokens, box/quad tokens, vision start/end tokens, image/video pads, audio start/end/pad tokens, and XML-style tool-call / tool-response formatting in its chat template. - OpenEuroLLM tokenizer-256k already reserves 100 unused tokens and includes chat-turn, image, FIM, and tool-call markers. The proposal here is to formalize part of that existing reserved space rather than expand the vocabulary.
Suggested Core Agent Tokens
<|system|>
<|developer|>
<|user|>
<|assistant|>
<|tool|>
<|tool_response|>
<|tool_error|>
<|channel|>
<|thought|>
<|answer|>
<|tools|>
</|tools|>
<|tool_schema|>
<|tool_call|>
</|tool_call|>
<|json|>
</|json|>
These would cover role separation, tool invocation, tool results, structured output, and simple channel/reasoning boundaries.
Optional Search / RAG Tokens
<|search_query|>
<|read_url|>
<|title|>
<|domain|>
<|authority|>
<|citation|>
These may be useful for browser agents, retrieval-augmented generation, citation-aware generation, source selection, and query rewriting.
Optional Multimodal / Grounding Tokens
<|vision_start|>
<|vision_end|>
<|image_pad|>
<|video_pad|>
<|audio_start|>
<|audio_end|>
<|audio_pad|>
<|box_start|>
<|box_end|>
<|object_ref_start|>
<|object_ref_end|>
These would align the tokenizer with the direction of current multimodal-agent models, where image, video, audio, bounding boxes, and object references often need explicit boundaries.
Possible Conservative Mapping
One possible approach would be to assign the first reserved IDs to the highest-value agent tokens:
<unused_0> -> <|system|>
<unused_1> -> <|developer|>
<unused_2> -> <|user|>
<unused_3> -> <|assistant|>
<unused_4> -> <|tool|>
<unused_5> -> <|tool_response|>
<unused_6> -> <|tool_error|>
<unused_7> -> <|channel|>
<unused_8> -> <|thought|>
<unused_9> -> <|answer|>
<unused_10> -> <|tools|>
<unused_11> -> </|tools|>
<unused_12> -> <|tool_schema|>
<unused_13> -> <|json|>
<unused_14> -> </|json|>
<unused_15> -> <|search_query|>
<unused_16> -> <|read_url|>
<unused_17> -> <|title|>
<unused_18> -> <|domain|>
<unused_19> -> <|authority|>
The existing <tool_call> and </tool_call> tokens should probably remain for compatibility. If a consistent namespace is preferred, <|tool_call|> and </|tool_call|> could either be added using reserved IDs or handled as aliases in downstream templates.
Recommendation
Keep the multilingual tokenizer unchanged, but assign a small, stable subset of the reserved tokens to agent, tool, structured-output, search/RAG, and multimodal formats before large-scale training data depends on the current unused-token names.
This would preserve the tokenizer's current strengths while making it easier for OpenEuroLLM models to support modern agentic workflows in a standardized way.
References
- OpenEuroLLM tokenizer-256k model card and special-token table: https://huggingface.co/openeurollm/tokenizer-256k
- OpenEuroLLM tokenizer configuration: https://huggingface.co/openeurollm/tokenizer-256k/blob/main/tokenizer_config.json
- DeepSeek-V4 technical report: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf
- DeepSeek-V4 encoding documentation: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/encoding/README.md
- DeepSeek-V4 reference encoder: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/encoding/encoding_dsv4.py
- Gemma 4 product page, including agentic workflow and function-calling positioning: https://deepmind.google/models/gemma/gemma-4/
- Gemma 4 tokenizer configuration: https://huggingface.co/google/gemma-4-31B-it/blob/main/tokenizer_config.json
- Qwen3.5 announcement: https://qwen.ai/blog?id=qwen3.5
- Qwen3.5 tokenizer configuration example: https://huggingface.co/Qwen/Qwen3.5-397B-A17B-FP8/blob/0cbad850b99e46399f2b600e120e69ebe3dcb499/tokenizer_config.json