damfle commited on
Commit
a2385fe
·
unverified ·
1 Parent(s): ca00efb

mod: improve compression and remove garbage

Browse files
Files changed (4) hide show
  1. README.md +8 -26
  2. chat_template.jinja +2 -2
  3. tokenizer.json +0 -0
  4. tokenizer_config.json +0 -11
README.md CHANGED
@@ -1,17 +1,3 @@
1
- ---
2
- license: isc
3
- datasets:
4
- - HuggingFaceFW/fineweb
5
- - HuggingFaceFW/fineweb-2
6
- - nick007x/github-code-2025
7
- language:
8
- - fr
9
- - en
10
- - zh
11
- pipeline_tag: token-classification
12
- tags:
13
- - code
14
- ---
15
  # Multistral Tokenizer
16
 
17
  Training completed successfully!
@@ -19,28 +5,24 @@ Training completed successfully!
19
  ## Configuration
20
  - Vocabulary size: 127,989
21
  - Special tokens: 13
22
- - Min frequency: 2
23
- - Training samples: up to 500,000
24
 
25
- ## Datasets
26
- - nick007x/github-code-2025 (35%)
27
- - HuggingFaceFW/fineweb-2 - Lojban (10%)
28
- - HuggingFaceFW/fineweb-2 - French (15%)
29
- - HuggingFaceFW/fineweb-2 - Chinese (15%)
30
- - HuggingFaceFW/fineweb - English (25%)
31
 
32
  ## Special Tokens
33
- ```<|begin|>, <|return|>, <|pad|>, <|start|>, <|channel|>, <|end|>, <|message|>, <|image|>, <|video|>, <|audio|>, <|call|>, <|constrain|>, <|unknown|>```
34
 
35
  ## Enforced Vocabulary
36
- ```analysis, assistant, commentary, developer, final, json, system, tool, toon, user, yaml```
37
 
38
  ## Usage
39
 
40
  ```python
41
  from multistral.multistraltokenizer import MultistralTokenizer
42
 
43
- tokenizer = MultistralTokenizer.from_pretrained("models/aizia_tokenizer")
44
  tokens = tokenizer.encode("Your text here")
45
  text = tokenizer.decode(tokens)
46
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # Multistral Tokenizer
2
 
3
  Training completed successfully!
 
5
  ## Configuration
6
  - Vocabulary size: 127,989
7
  - Special tokens: 13
8
+ - Min frequency: 32
9
+ - Training samples: up to 1,000,000
10
 
11
+ ## Dataset
12
+ - Source: dataset/
 
 
 
 
13
 
14
  ## Special Tokens
15
+ <|begin|>, <|return|>, <|pad|>, <|start|>, <|channel|>, <|end|>, <|message|>, <|image|>, <|video|>, <|audio|>, <|call|>, <|constrain|>, <|unknown|>
16
 
17
  ## Enforced Vocabulary
18
+ analysis, assistant, commentary, developer, final, json, system, tool, toon, user, yaml
19
 
20
  ## Usage
21
 
22
  ```python
23
  from multistral.multistraltokenizer import MultistralTokenizer
24
 
25
+ tokenizer = MultistralTokenizer.from_pretrained("models/multistral-tokenizer")
26
  tokens = tokenizer.encode("Your text here")
27
  text = tokenizer.decode(tokens)
28
+ ```
chat_template.jinja CHANGED
@@ -93,10 +93,10 @@ type {{ tool.name }} = {% if tool.parameters and tool.parameters.properties %}(_
93
 
94
  {#- System Message -#}
95
  <|start|>system<|message|>{{ model_identity or "You are Aizia, a helpful assistant trained by damfle." }}
96
- Knowledge cutoff: 2025-12
97
  Current date: {{ strftime_now("%Y-%m-%d") }}
98
 
99
- Reasoning: {{ reasoning_effort or "none" }}
100
 
101
  # Valid channels: analysis, commentary, final. Channel must be included for every message.
102
  {%- if tools %}
 
93
 
94
  {#- System Message -#}
95
  <|start|>system<|message|>{{ model_identity or "You are Aizia, a helpful assistant trained by damfle." }}
96
+ Knowledge cutoff: {{ knowledge_cutoff or "2025-12" }}
97
  Current date: {{ strftime_now("%Y-%m-%d") }}
98
 
99
+ Reasoning: {{ reasoning_effort or "medium" }}
100
 
101
  # Valid channels: analysis, commentary, final. Channel must be included for every message.
102
  {%- if tools %}
tokenizer.json CHANGED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json CHANGED
@@ -1,15 +1,4 @@
1
  {
2
- "additional_special_tokens": [
3
- "<|start|>",
4
- "<|channel|>",
5
- "<|end|>",
6
- "<|message|>",
7
- "<|image|>",
8
- "<|video|>",
9
- "<|audio|>",
10
- "<|call|>",
11
- "<|constrain|>"
12
- ],
13
  "backend": "tokenizers",
14
  "bos_token": "<|begin|>",
15
  "eos_token": "<|return|>",
 
1
  {
 
 
 
 
 
 
 
 
 
 
 
2
  "backend": "tokenizers",
3
  "bos_token": "<|begin|>",
4
  "eos_token": "<|return|>",