saumyamalik's picture
Update README.md
93b8593 verified
---
library_name: transformers
tags: []
---
# Dolma 2 tokenizer, Instruct v5, Non-reasoner version
Slightly modified version of `cl100k_base` that supports Dolma 1.x and Dolma 2.x special tokens.
## Special tokens
This tokenizer supports the following special tokens:
- `<|extra_id_0|>`: Not used.
- `<|endoftext|>`: Used to mark both beginning and end of text.
- `<|fim_prefix|>`: Used to mark the prefix fill-in-the-middle request.
- `<|fim_middle|>`: Used to mark the middle fill-in-the-middle request.
- `<|fim_suffix|>`: Used to mark the suffix fill-in-the-middle request.
- `|||PHONE_NUMBER|||`: Not used. Kept for compatibility with Dolma 1.x.
- `|||EMAIL_ADDRESS|||`: Not used. Kept for compatibility with Dolma 1.x.
- `|||IP_ADDRESS|||`: Not used. Kept for compatibility with Dolma 1.x.
- `<|im_start|>`: Indicates the beginning of a message (turn in a conversation).
- `<|im_end|>`: Indicates the end of a message (turn in a conversation).
- `<functions>`: Indicates start of function definitions in the system prompt for tool use.
- `</functions>`: Indicates end of function definitions in the sytem prompt.
- `<function_calls>`: Indicates start of function calls made by the model.
- `</function_calls>`: Indicates end of function calls made by the model.
- `<|extra_id_1|>`: Not used.
- `<|extra_id_2|>`: Not used.
- `<|extra_id_3|>`: Not used.
- `<|extra_id_4|>`: Not used.
- `<|extra_id_5|>`: Not used.
- `<|extra_id_6|>`: Not used.
- `<|extra_id_7|>`: Not used.
- `<|extra_id_8|>`: Not used.
- `<|extra_id_9|>`: Not used.
- `<|extra_id_10|>`: Not used.
- `<|endofprompt|>`: Not Used.
- `<|pad|>`: Symbol to pad input sequences.
## Chat template
The chat template is as follows (**for reference only**, actual template is in `tokenizer_config.json`):
```jinja
{% set has_system = messages|selectattr('role', 'equalto', 'system')|list|length > 0 %}
{% if not has_system %}
{{ '<|im_start|>system
You are Olmo, a helpful function-calling AI assistant built by Ai2. Your date cutoff is December 2024, and your model weights are available at https://huggingface.co/allenai. You do not currently have access to any functions. <functions></functions><|im_end|>
' }}
{% endif %}
{% Youfor message in messages %}
{% if message['role'] == 'system' %}
{{ '<|im_start|>system
' + message['content'] }}
{% if message.get('functions', none) is not none %}
{{ ' <functions>' + message['functions'] + '</functions><|im_end|>
' }}
{% else %}
{{ ' do not currently have access to any functions. <functions></functions><|im_end|>
' }}
{% endif %}
{% elif message['role'] == 'user' %}
{% if message.get('functions', none) is not none %}
{{ '<|im_start|>user
' + message['content'] + '
' + '<functions>' + message['functions'] + '</functions><|im_end|>
' }}
{% else %}
{{ '<|im_start|>user
' + message['content'] + '<|im_end|>
' }}
{% endif %}
{% elif message['role'] == 'assistant' %}
{{ '<|im_start|>assistant
' }}
{% if message.get('content', none) is not none %}
{{ message['content'] }}
{% endif %}
{% if message.get('function_calls', none) is not none %}
{{ '<function_calls>' + message['function_calls'] + '</function_calls>' }}
{% endif %}
{% if not loop.last %}
{{ '<|im_end|>' + '
' }}
{% else %}
{{ eos_token }}
{% endif %}
{% elif message['role'] == 'environment' %}
{{ '<|im_start|>environment
' + message['content'] + '<|im_end|>
' }}
{% endif %}
{% if loop.last and add_generation_prompt %}
{{ '<|im_start|>assistant
' }}
{% endif %}
{% endfor %}
```