Proposal: use reserved tokens for agent/tool-use and multimodal control tokens

by birgermoell - opened Apr 27

OpenEuroLLM org Apr 27

Proposal: Reserve Special Tokens for Agent, Tool, and Multimodal Workflows

Thank you for publishing openeurollm/tokenizer-256k. The tokenizer already looks strong for multilingual European coverage, and the existing <unused_0> to <unused_99> range seems like a useful opportunity to future-proof it before downstream training formats become fixed.

The current tokenizer already includes several helpful foundations:

<start_of_turn>
<end_of_turn>
<start_of_image>
<end_of_image>
<image_soft_token>
<fim_prefix>
<fim_middle>
<fim_suffix>
<tool_call>
</tool_call>
<unused_0> ... <unused_99>

I would suggest formalizing part of the unused range for modern agent, tool-use, structured-output, and multimodal workflows, without changing the total vocabulary size.

Motivation

Recent model families such as DeepSeek-V4, Gemma 4, and Qwen3.5 appear to be moving toward richer tokenizer-level conventions for:

role and channel separation
tool calls and tool responses
structured JSON output
browser, search, and RAG workflows
multimodal placeholders
grounding and object references
reasoning or control boundaries

Special tokens do not make a model agentic by themselves. They only become useful if pretraining, SFT data, chat templates, tool traces, and evaluation harnesses use them consistently. But defining them early avoids format churn later and gives downstream training pipelines a stable convention to target.

Related Model Conventions

Several recent open model families already expose comparable conventions:

DeepSeek-V4 documents an encoding format for multi-turn chat, tool calling, extended thinking, and quick-instruction tasks. Its reference encoder uses role markers, <think> / </think>, a DSML marker, tool-result formatting, and quick-instruction tokens such as query, authority, domain, title, and read-url.
Gemma 4 advertises native support for agentic workflows and function calling. Its tokenizer configuration includes dedicated tokens for channels, turns, tool calls, tools, tool responses, thinking, images, audio, and video.
Qwen3.5 exposes multimodal and agent-oriented tokenizer conventions, including <|im_start|>, <|im_end|>, object-reference tokens, box/quad tokens, vision start/end tokens, image/video pads, audio start/end/pad tokens, and XML-style tool-call / tool-response formatting in its chat template.
OpenEuroLLM tokenizer-256k already reserves 100 unused tokens and includes chat-turn, image, FIM, and tool-call markers. The proposal here is to formalize part of that existing reserved space rather than expand the vocabulary.

Suggested Core Agent Tokens

<|system|>
<|developer|>
<|user|>
<|assistant|>
<|tool|>
<|tool_response|>
<|tool_error|>

<|channel|>
<|thought|>
<|answer|>

<|tools|>
</|tools|>
<|tool_schema|>
<|tool_call|>
</|tool_call|>

<|json|>
</|json|>

These would cover role separation, tool invocation, tool results, structured output, and simple channel/reasoning boundaries.

Optional Search / RAG Tokens

<|search_query|>
<|read_url|>
<|title|>
<|domain|>
<|authority|>
<|citation|>

These may be useful for browser agents, retrieval-augmented generation, citation-aware generation, source selection, and query rewriting.

Optional Multimodal / Grounding Tokens

<|vision_start|>
<|vision_end|>
<|image_pad|>
<|video_pad|>
<|audio_start|>
<|audio_end|>
<|audio_pad|>
<|box_start|>
<|box_end|>
<|object_ref_start|>
<|object_ref_end|>

These would align the tokenizer with the direction of current multimodal-agent models, where image, video, audio, bounding boxes, and object references often need explicit boundaries.

Possible Conservative Mapping

One possible approach would be to assign the first reserved IDs to the highest-value agent tokens:

<unused_0>  -> <|system|>
<unused_1>  -> <|developer|>
<unused_2>  -> <|user|>
<unused_3>  -> <|assistant|>
<unused_4>  -> <|tool|>
<unused_5>  -> <|tool_response|>
<unused_6>  -> <|tool_error|>
<unused_7>  -> <|channel|>
<unused_8>  -> <|thought|>
<unused_9>  -> <|answer|>
<unused_10> -> <|tools|>
<unused_11> -> </|tools|>
<unused_12> -> <|tool_schema|>
<unused_13> -> <|json|>
<unused_14> -> </|json|>
<unused_15> -> <|search_query|>
<unused_16> -> <|read_url|>
<unused_17> -> <|title|>
<unused_18> -> <|domain|>
<unused_19> -> <|authority|>

The existing <tool_call> and </tool_call> tokens should probably remain for compatibility. If a consistent namespace is preferred, <|tool_call|> and </|tool_call|> could either be added using reserved IDs or handled as aliases in downstream templates.

Recommendation

Keep the multilingual tokenizer unchanged, but assign a small, stable subset of the reserved tokens to agent, tool, structured-output, search/RAG, and multimodal formats before large-scale training data depends on the current unused-token names.

This would preserve the tokenizer's current strengths while making it easier for OpenEuroLLM models to support modern agentic workflows in a standardized way.

References

OpenEuroLLM tokenizer-256k model card and special-token table: https://huggingface.co/openeurollm/tokenizer-256k
OpenEuroLLM tokenizer configuration: https://huggingface.co/openeurollm/tokenizer-256k/blob/main/tokenizer_config.json
DeepSeek-V4 technical report: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf
DeepSeek-V4 encoding documentation: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/encoding/README.md
DeepSeek-V4 reference encoder: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/encoding/encoding_dsv4.py
Gemma 4 product page, including agentic workflow and function-calling positioning: https://deepmind.google/models/gemma/gemma-4/
Gemma 4 tokenizer configuration: https://huggingface.co/google/gemma-4-31B-it/blob/main/tokenizer_config.json
Qwen3.5 announcement: https://qwen.ai/blog?id=qwen3.5
Qwen3.5 tokenizer configuration example: https://huggingface.co/Qwen/Qwen3.5-397B-A17B-FP8/blob/0cbad850b99e46399f2b600e120e69ebe3dcb499/tokenizer_config.json

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment