|
|
--- |
|
|
license: isc |
|
|
datasets: |
|
|
- HuggingFaceFW/fineweb |
|
|
- HuggingFaceFW/fineweb-2 |
|
|
- nick007x/github-code-2025 |
|
|
language: |
|
|
- fr |
|
|
- en |
|
|
- zh |
|
|
pipeline_tag: token-classification |
|
|
tags: |
|
|
- code |
|
|
--- |
|
|
# Multistral Tokenizer |
|
|
|
|
|
Training completed successfully! |
|
|
|
|
|
## Configuration |
|
|
- Vocabulary size: 127,989 |
|
|
- Special tokens: 13 |
|
|
- Min frequency: 2 |
|
|
- Training samples: up to 500,000 |
|
|
|
|
|
## Datasets |
|
|
- nick007x/github-code-2025 (35%) |
|
|
- HuggingFaceFW/fineweb-2 - Lojban (10%) |
|
|
- HuggingFaceFW/fineweb-2 - French (15%) |
|
|
- HuggingFaceFW/fineweb-2 - Chinese (15%) |
|
|
- HuggingFaceFW/fineweb - English (25%) |
|
|
|
|
|
## Special Tokens |
|
|
```<|begin|>, <|return|>, <|pad|>, <|start|>, <|channel|>, <|end|>, <|message|>, <|image|>, <|video|>, <|audio|>, <|call|>, <|constrain|>, <|unknown|>``` |
|
|
|
|
|
## Enforced Vocabulary |
|
|
```analysis, assistant, commentary, developer, final, json, system, tool, toon, user, yaml``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from multistral.multistraltokenizer import MultistralTokenizer |
|
|
|
|
|
tokenizer = MultistralTokenizer.from_pretrained("models/aizia_tokenizer") |
|
|
tokens = tokenizer.encode("Your text here") |
|
|
text = tokenizer.decode(tokens) |
|
|
``` |