--- library_name: transformers tags: [] --- # Dolma 2 tokenizer, Instruct v5, Non-reasoner version Slightly modified version of `cl100k_base` that supports Dolma 1.x and Dolma 2.x special tokens. ## Special tokens This tokenizer supports the following special tokens: - `<|extra_id_0|>`: Not used. - `<|endoftext|>`: Used to mark both beginning and end of text. - `<|fim_prefix|>`: Used to mark the prefix fill-in-the-middle request. - `<|fim_middle|>`: Used to mark the middle fill-in-the-middle request. - `<|fim_suffix|>`: Used to mark the suffix fill-in-the-middle request. - `|||PHONE_NUMBER|||`: Not used. Kept for compatibility with Dolma 1.x. - `|||EMAIL_ADDRESS|||`: Not used. Kept for compatibility with Dolma 1.x. - `|||IP_ADDRESS|||`: Not used. Kept for compatibility with Dolma 1.x. - `<|im_start|>`: Indicates the beginning of a message (turn in a conversation). - `<|im_end|>`: Indicates the end of a message (turn in a conversation). - ``: Indicates start of function definitions in the system prompt for tool use. - ``: Indicates end of function definitions in the sytem prompt. - ``: Indicates start of function calls made by the model. - ``: Indicates end of function calls made by the model. - `<|extra_id_1|>`: Not used. - `<|extra_id_2|>`: Not used. - `<|extra_id_3|>`: Not used. - `<|extra_id_4|>`: Not used. - `<|extra_id_5|>`: Not used. - `<|extra_id_6|>`: Not used. - `<|extra_id_7|>`: Not used. - `<|extra_id_8|>`: Not used. - `<|extra_id_9|>`: Not used. - `<|extra_id_10|>`: Not used. - `<|endofprompt|>`: Not Used. - `<|pad|>`: Symbol to pad input sequences. ## Chat template The chat template is as follows (**for reference only**, actual template is in `tokenizer_config.json`): ```jinja {% set has_system = messages|selectattr('role', 'equalto', 'system')|list|length > 0 %} {% if not has_system %} {{ '<|im_start|>system You are Olmo, a helpful function-calling AI assistant built by Ai2. Your date cutoff is December 2024, and your model weights are available at https://huggingface.co/allenai. You do not currently have access to any functions. <|im_end|> ' }} {% endif %} {% Youfor message in messages %} {% if message['role'] == 'system' %} {{ '<|im_start|>system ' + message['content'] }} {% if message.get('functions', none) is not none %} {{ ' ' + message['functions'] + '<|im_end|> ' }} {% else %} {{ ' do not currently have access to any functions. <|im_end|> ' }} {% endif %} {% elif message['role'] == 'user' %} {% if message.get('functions', none) is not none %} {{ '<|im_start|>user ' + message['content'] + ' ' + '' + message['functions'] + '<|im_end|> ' }} {% else %} {{ '<|im_start|>user ' + message['content'] + '<|im_end|> ' }} {% endif %} {% elif message['role'] == 'assistant' %} {{ '<|im_start|>assistant ' }} {% if message.get('content', none) is not none %} {{ message['content'] }} {% endif %} {% if message.get('function_calls', none) is not none %} {{ '' + message['function_calls'] + '' }} {% endif %} {% if not loop.last %} {{ '<|im_end|>' + ' ' }} {% else %} {{ eos_token }} {% endif %} {% elif message['role'] == 'environment' %} {{ '<|im_start|>environment ' + message['content'] + '<|im_end|> ' }} {% endif %} {% if loop.last and add_generation_prompt %} {{ '<|im_start|>assistant ' }} {% endif %} {% endfor %} ```