File size: 3,604 Bytes
b369210
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
93b8593
 
 
 
b369210
 
 
 
 
 
93b8593
 
 
 
b369210
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
---
library_name: transformers
tags: []
---

# Dolma 2 tokenizer, Instruct v5, Non-reasoner version

Slightly modified version of `cl100k_base` that supports Dolma 1.x  and Dolma 2.x special tokens.

## Special tokens

This tokenizer supports the following special tokens:

- `<|extra_id_0|>`: Not used.
- `<|endoftext|>`: Used to mark both beginning and end of text.
- `<|fim_prefix|>`: Used to mark the prefix fill-in-the-middle request.
- `<|fim_middle|>`: Used to mark the middle fill-in-the-middle request.
- `<|fim_suffix|>`: Used to mark the suffix fill-in-the-middle request.
- `|||PHONE_NUMBER|||`: Not used. Kept for compatibility with Dolma 1.x.
- `|||EMAIL_ADDRESS|||`: Not used. Kept for compatibility with Dolma 1.x.
- `|||IP_ADDRESS|||`: Not used. Kept for compatibility with Dolma 1.x.
- `<|im_start|>`: Indicates the beginning of a message (turn in a conversation).
- `<|im_end|>`: Indicates the end of a message (turn in a conversation).
- `<functions>`: Indicates start of function definitions in the system prompt for tool use.
- `</functions>`: Indicates end of function definitions in the sytem prompt.
- `<function_calls>`: Indicates start of function calls made by the model.
- `</function_calls>`: Indicates end of function calls made by the model.
- `<|extra_id_1|>`: Not used.
- `<|extra_id_2|>`: Not used.
- `<|extra_id_3|>`: Not used.
- `<|extra_id_4|>`: Not used.
- `<|extra_id_5|>`: Not used.
- `<|extra_id_6|>`: Not used.
- `<|extra_id_7|>`: Not used.
- `<|extra_id_8|>`: Not used.
- `<|extra_id_9|>`: Not used.
- `<|extra_id_10|>`: Not used.
- `<|endofprompt|>`: Not Used.
- `<|pad|>`: Symbol to pad input sequences.


## Chat template

The chat template is as follows (**for reference only**, actual template is in `tokenizer_config.json`):

```jinja
{% set has_system = messages|selectattr('role', 'equalto', 'system')|list|length > 0 %}
{% if not has_system %}
{{ '<|im_start|>system
You are Olmo, a helpful function-calling AI assistant built by Ai2. Your date cutoff is December 2024, and your model weights are available at https://huggingface.co/allenai. You do not currently have access to any functions. <functions></functions><|im_end|>
' }}
{% endif %}
{% Youfor message in messages %}
    {% if message['role'] == 'system' %}
{{ '<|im_start|>system
' + message['content'] }}
        {% if message.get('functions', none) is not none %}
{{ ' <functions>' + message['functions'] + '</functions><|im_end|>
' }}
        {% else %}
{{ '  do not currently have access to any functions. <functions></functions><|im_end|>
' }}
        {% endif %}
    {% elif message['role'] == 'user' %}
        {% if message.get('functions', none) is not none %}
{{ '<|im_start|>user
' + message['content'] + '
' + '<functions>' + message['functions'] + '</functions><|im_end|>
' }}
        {% else %}
{{ '<|im_start|>user
' + message['content'] + '<|im_end|>
' }}
        {% endif %}
    {% elif message['role'] == 'assistant' %}
{{ '<|im_start|>assistant
' }}
        {% if message.get('content', none) is not none %}
{{ message['content'] }}
        {% endif %}
        {% if message.get('function_calls', none) is not none %}
{{ '<function_calls>' + message['function_calls'] + '</function_calls>' }}
        {% endif %}
        {% if not loop.last %}
{{ '<|im_end|>' + '
' }}
        {% else %}
{{ eos_token }}
        {% endif %}
    {% elif message['role'] == 'environment' %}
{{ '<|im_start|>environment
' + message['content'] + '<|im_end|>
' }}
    {% endif %}
    {% if loop.last and add_generation_prompt %}
{{ '<|im_start|>assistant
' }}
    {% endif %}
{% endfor %}
```