| | --- |
| | library_name: transformers |
| | tags: [] |
| | --- |
| | |
| | # Dolma 2 tokenizer, Instruct v2, reasoner version |
| |
|
| | Slightly modified version of `cl100k_base` that supports Dolma 1.x and Dolma 2.x special tokens. |
| |
|
| | ## Differences from v4 |
| |
|
| | Adds special tokens (definitions in the next section) |
| |
|
| | - `<functions>` |
| | - `</functions>` |
| | - `<function_calls>` |
| | - `</function_calls>` |
| |
|
| | ## Special tokens |
| |
|
| | This tokenizer supports the following special tokens: |
| |
|
| | - `<|extra_id_0|>`: Not used. |
| | - `<|endoftext|>`: Used to mark both beginning and end of text. |
| | - `<|fim_prefix|>`: Used to mark the prefix fill-in-the-middle request. |
| | - `<|fim_middle|>`: Used to mark the middle fill-in-the-middle request. |
| | - `<|fim_suffix|>`: Used to mark the suffix fill-in-the-middle request. |
| | - `|||PHONE_NUMBER|||`: Not used. Kept for compatibility with Dolma 1.x. |
| | - `|||EMAIL_ADDRESS|||`: Not used. Kept for compatibility with Dolma 1.x. |
| | - `|||IP_ADDRESS|||`: Not used. Kept for compatibility with Dolma 1.x. |
| | - `<|im_start|>`: Indicates the beginning of a message (turn in a conversation). |
| | - `<|im_end|>`: Indicates the end of a message (turn in a conversation). |
| | - `<functions>`: Indicates the beginning of function definitions in the system prompt. |
| | - `</functions>`: Indicates the end of function definitions in the system prompt. |
| | - `<function_calls>`: Indicates the beginning of function calls made by the assistant. |
| | - `</function_calls>`: Indicates the end of function calls made by the assistant. |
| | - `<|extra_id_1|>`: Not used. |
| | - `<|extra_id_2|>`: Not used. |
| | - `<|extra_id_3|>`: Not used. |
| | - `<|extra_id_4|>`: Not used. |
| | - `<|extra_id_5|>`: Not used. |
| | - `<|extra_id_6|>`: Not used. |
| | - `<|endofprompt|>`: Not Used. |
| | - `<|pad|>`: Symbol to pad input sequences. |
| |
|
| |
|
| | ## Chat template |
| |
|
| | The chat template is as follows (**for reference only**, actual template is in `tokenizer_config.json`): |
| |
|
| | ```jinja |
| | {% set has_system = messages|selectattr('role', 'equalto', 'system')|list|length > 0 %} |
| | {% if not has_system %} |
| | {{ '<|im_start|>system |
| | You are Olmo, a helpful AI assistant built by Ai2. Your date cutoff is December 2024, and your model weights are available at https://huggingface.co/allenai.<|im_end|> |
| | ' }} |
| | {% endif %} |
| | {% for message in messages %} |
| | {% if message['role'] == 'system' %} |
| | {{ '<|im_start|>system |
| | ' + message['content'] }} |
| | {% if message.get('functions', none) is not none %} |
| | {{ ' <functions>' + message['functions'] + '</functions><|im_end|> |
| | ' }} |
| | {% else %} |
| | {{ ' You do not currently have access to any functions. <functions></functions><|im_end|> |
| | ' }} |
| | {% endif %} |
| | {% elif message['role'] == 'user' %} |
| | {% if message.get('functions', none) is not none %} |
| | {{ '<|im_start|>user |
| | ' + message['content'] + ' |
| | ' + '<functions>' + message['functions'] + '</functions><|im_end|> |
| | ' }} |
| | {% else %} |
| | {{ '<|im_start|>user |
| | ' + message['content'] + '<|im_end|> |
| | ' }} |
| | {% endif %} |
| | {% elif message['role'] == 'assistant' %} |
| | {{ '<|im_start|>assistant |
| | ' }} |
| | {% if message.get('content', none) is not none %} |
| | {{ message['content'] }} |
| | {% endif %} |
| | {% if message.get('function_calls', none) is not none %} |
| | {{ '<function_calls>' + message['function_calls'] + '</function_calls>' }} |
| | {% endif %} |
| | {% if not loop.last %} |
| | {{ '<|im_end|>' + ' |
| | ' }} |
| | {% else %} |
| | {{ eos_token }} |
| | {% endif %} |
| | {% elif message['role'] == 'environment' %} |
| | {{ '<|im_start|>environment |
| | ' + message['content'] + '<|im_end|> |
| | ' }} |
| | {% endif %} |
| | {% if loop.last and add_generation_prompt %} |
| | {{ '<|im_start|>assistant |
| | <think>' }} |
| | {% endif %} |
| | {% endfor %} |
| | ``` |
| |
|