init: initial commit

Files changed (4) hide show

README.md CHANGED Viewed

@@ -1,3 +1,32 @@
----
-license: isc
----

+# Multistral Tokenizer
+Training completed successfully!
+## Configuration
+- Vocabulary size: 127,989
+- Special tokens: 13
+- Min frequency: 2
+- Training samples: up to 500,000
+## Datasets
+- nick007x/github-code-2025 (35%)
+- HuggingFaceFW/fineweb-2 (10%)
+- HuggingFaceFW/fineweb-2 (15%)
+- HuggingFaceFW/fineweb-2 (15%)
+- HuggingFaceFW/fineweb (25%)
+## Special Tokens
+<|begin|>, <|return|>, <|pad|>, <|start|>, <|channel|>, <|end|>, <|message|>, <|image|>, <|video|>, <|audio|>, <|call|>, <|constrain|>, <|unknown|>
+## Enforced Vocabulary
+analysis, assistant, commentary, developer, final, json, system, tool, toon, user, yaml
+## Usage
+```python
+from multistral.multistraltokenizer import MultistralTokenizer
+tokenizer = MultistralTokenizer.from_pretrained("models/aizia_tokenizer")
+tokens = tokenizer.encode("Your text here")
+text = tokenizer.decode(tokens)
+```

special_tokens_map.json ADDED Viewed

+{
+  "bos_token": {
+    "content": "<|begin|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|return|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|pad|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<|unknown|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

+{
+  "additional_special_tokens": [
+    "<|start|>",
+    "<|channel|>",
+    "<|end|>",
+    "<|message|>",
+    "<|image|>",
+    "<|video|>",
+    "<|audio|>",
+    "<|call|>",
+    "<|constrain|>"
+  ],
+  "backend": "tokenizers",
+  "bos_token": "<|begin|>",
+  "eos_token": "<|return|>",
+  "extra_special_tokens": [
+    "<|start|>",
+    "<|channel|>",
+    "<|end|>",
+    "<|message|>",
+    "<|image|>",
+    "<|video|>",
+    "<|audio|>",
+    "<|call|>",
+    "<|constrain|>"
+  ],
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "<|pad|>",
+  "tokenizer_class": "MultistralTokenizer",
+  "unk_token": "<|unknown|>"
+}