|
|
--- |
|
|
library_name: transformers |
|
|
tags: [] |
|
|
--- |
|
|
|
|
|
# Dolma 2 tokenizer, Instruct v5, Non-reasoner version |
|
|
|
|
|
Slightly modified version of `cl100k_base` that supports Dolma 1.x and Dolma 2.x special tokens. |
|
|
|
|
|
## Special tokens |
|
|
|
|
|
This tokenizer supports the following special tokens: |
|
|
|
|
|
- `<|extra_id_0|>`: Not used. |
|
|
- `<|endoftext|>`: Used to mark both beginning and end of text. |
|
|
- `<|fim_prefix|>`: Used to mark the prefix fill-in-the-middle request. |
|
|
- `<|fim_middle|>`: Used to mark the middle fill-in-the-middle request. |
|
|
- `<|fim_suffix|>`: Used to mark the suffix fill-in-the-middle request. |
|
|
- `|||PHONE_NUMBER|||`: Not used. Kept for compatibility with Dolma 1.x. |
|
|
- `|||EMAIL_ADDRESS|||`: Not used. Kept for compatibility with Dolma 1.x. |
|
|
- `|||IP_ADDRESS|||`: Not used. Kept for compatibility with Dolma 1.x. |
|
|
- `<|im_start|>`: Indicates the beginning of a message (turn in a conversation). |
|
|
- `<|im_end|>`: Indicates the end of a message (turn in a conversation). |
|
|
- `<functions>`: Indicates start of function definitions in the system prompt for tool use. |
|
|
- `</functions>`: Indicates end of function definitions in the sytem prompt. |
|
|
- `<function_calls>`: Indicates start of function calls made by the model. |
|
|
- `</function_calls>`: Indicates end of function calls made by the model. |
|
|
- `<|extra_id_1|>`: Not used. |
|
|
- `<|extra_id_2|>`: Not used. |
|
|
- `<|extra_id_3|>`: Not used. |
|
|
- `<|extra_id_4|>`: Not used. |
|
|
- `<|extra_id_5|>`: Not used. |
|
|
- `<|extra_id_6|>`: Not used. |
|
|
- `<|extra_id_7|>`: Not used. |
|
|
- `<|extra_id_8|>`: Not used. |
|
|
- `<|extra_id_9|>`: Not used. |
|
|
- `<|extra_id_10|>`: Not used. |
|
|
- `<|endofprompt|>`: Not Used. |
|
|
- `<|pad|>`: Symbol to pad input sequences. |
|
|
|
|
|
|
|
|
## Chat template |
|
|
|
|
|
The chat template is as follows (**for reference only**, actual template is in `tokenizer_config.json`): |
|
|
|
|
|
```jinja |
|
|
{% set has_system = messages|selectattr('role', 'equalto', 'system')|list|length > 0 %} |
|
|
{% if not has_system %} |
|
|
{{ '<|im_start|>system |
|
|
You are Olmo, a helpful function-calling AI assistant built by Ai2. Your date cutoff is December 2024, and your model weights are available at https://huggingface.co/allenai. You do not currently have access to any functions. <functions></functions><|im_end|> |
|
|
' }} |
|
|
{% endif %} |
|
|
{% Youfor message in messages %} |
|
|
{% if message['role'] == 'system' %} |
|
|
{{ '<|im_start|>system |
|
|
' + message['content'] }} |
|
|
{% if message.get('functions', none) is not none %} |
|
|
{{ ' <functions>' + message['functions'] + '</functions><|im_end|> |
|
|
' }} |
|
|
{% else %} |
|
|
{{ ' do not currently have access to any functions. <functions></functions><|im_end|> |
|
|
' }} |
|
|
{% endif %} |
|
|
{% elif message['role'] == 'user' %} |
|
|
{% if message.get('functions', none) is not none %} |
|
|
{{ '<|im_start|>user |
|
|
' + message['content'] + ' |
|
|
' + '<functions>' + message['functions'] + '</functions><|im_end|> |
|
|
' }} |
|
|
{% else %} |
|
|
{{ '<|im_start|>user |
|
|
' + message['content'] + '<|im_end|> |
|
|
' }} |
|
|
{% endif %} |
|
|
{% elif message['role'] == 'assistant' %} |
|
|
{{ '<|im_start|>assistant |
|
|
' }} |
|
|
{% if message.get('content', none) is not none %} |
|
|
{{ message['content'] }} |
|
|
{% endif %} |
|
|
{% if message.get('function_calls', none) is not none %} |
|
|
{{ '<function_calls>' + message['function_calls'] + '</function_calls>' }} |
|
|
{% endif %} |
|
|
{% if not loop.last %} |
|
|
{{ '<|im_end|>' + ' |
|
|
' }} |
|
|
{% else %} |
|
|
{{ eos_token }} |
|
|
{% endif %} |
|
|
{% elif message['role'] == 'environment' %} |
|
|
{{ '<|im_start|>environment |
|
|
' + message['content'] + '<|im_end|> |
|
|
' }} |
|
|
{% endif %} |
|
|
{% if loop.last and add_generation_prompt %} |
|
|
{{ '<|im_start|>assistant |
|
|
' }} |
|
|
{% endif %} |
|
|
{% endfor %} |
|
|
``` |
|
|
|