mod: improve compression and remove garbage

Files changed (4) hide show

README.md CHANGED Viewed

@@ -1,17 +1,3 @@
----
-license: isc
-datasets:
-- HuggingFaceFW/fineweb
-- HuggingFaceFW/fineweb-2
-- nick007x/github-code-2025
-language:
-- fr
-- en
-- zh
-pipeline_tag: token-classification
-tags:
-- code
----
 # Multistral Tokenizer
 Training completed successfully!
@@ -19,28 +5,24 @@ Training completed successfully!
 ## Configuration
 - Vocabulary size: 127,989
 - Special tokens: 13
-- Min frequency: 2
-- Training samples: up to 500,000
-## Datasets
-- nick007x/github-code-2025 (35%)
-- HuggingFaceFW/fineweb-2 - Lojban (10%)
-- HuggingFaceFW/fineweb-2 - French (15%)
-- HuggingFaceFW/fineweb-2 - Chinese (15%)
-- HuggingFaceFW/fineweb - English (25%)
 ## Special Tokens
-```<|begin|>, <|return|>, <|pad|>, <|start|>, <|channel|>, <|end|>, <|message|>, <|image|>, <|video|>, <|audio|>, <|call|>, <|constrain|>, <|unknown|>```
 ## Enforced Vocabulary
-```analysis, assistant, commentary, developer, final, json, system, tool, toon, user, yaml```
 ## Usage
 ```python
 from multistral.multistraltokenizer import MultistralTokenizer
-tokenizer = MultistralTokenizer.from_pretrained("models/aizia_tokenizer")
 tokens = tokenizer.encode("Your text here")
 text = tokenizer.decode(tokens)
-```

 # Multistral Tokenizer
 Training completed successfully!
 ## Configuration
 - Vocabulary size: 127,989
 - Special tokens: 13
+- Min frequency: 32
+- Training samples: up to 1,000,000
+## Dataset
+- Source: dataset/
 ## Special Tokens
+<|begin|>, <|return|>, <|pad|>, <|start|>, <|channel|>, <|end|>, <|message|>, <|image|>, <|video|>, <|audio|>, <|call|>, <|constrain|>, <|unknown|>
 ## Enforced Vocabulary
+analysis, assistant, commentary, developer, final, json, system, tool, toon, user, yaml
 ## Usage
 ```python
 from multistral.multistraltokenizer import MultistralTokenizer
+tokenizer = MultistralTokenizer.from_pretrained("models/multistral-tokenizer")
 tokens = tokenizer.encode("Your text here")
 text = tokenizer.decode(tokens)
+```

chat_template.jinja CHANGED Viewed

@@ -93,10 +93,10 @@ type {{ tool.name }} = {% if tool.parameters and tool.parameters.properties %}(_
 {#- System Message -#}
 <|start|>system<|message|>{{ model_identity or "You are Aizia, a helpful assistant trained by damfle." }}
-Knowledge cutoff: 2025-12
 Current date: {{ strftime_now("%Y-%m-%d") }}
-Reasoning: {{ reasoning_effort or "none" }}
 # Valid channels: analysis, commentary, final. Channel must be included for every message.
 {%- if tools %}

 {#- System Message -#}
 <|start|>system<|message|>{{ model_identity or "You are Aizia, a helpful assistant trained by damfle." }}
+Knowledge cutoff: {{ knowledge_cutoff or "2025-12" }}
 Current date: {{ strftime_now("%Y-%m-%d") }}
+Reasoning: {{ reasoning_effort or "medium" }}
 # Valid channels: analysis, commentary, final. Channel must be included for every message.
 {%- if tools %}

tokenizer.json CHANGED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json CHANGED Viewed

@@ -1,15 +1,4 @@
 {
-  "additional_special_tokens": [
-    "<|start|>",
-    "<|channel|>",
-    "<|end|>",
-    "<|message|>",
-    "<|image|>",
-    "<|video|>",
-    "<|audio|>",
-    "<|call|>",
-    "<|constrain|>"
-  ],
   "backend": "tokenizers",
   "bos_token": "<|begin|>",
   "eos_token": "<|return|>",

 {
   "backend": "tokenizers",
   "bos_token": "<|begin|>",
   "eos_token": "<|return|>",