--- license: isc datasets: - HuggingFaceFW/fineweb - HuggingFaceFW/fineweb-2 - nick007x/github-code-2025 language: - fr - en - zh pipeline_tag: token-classification tags: - code --- # Multistral Tokenizer Training completed successfully! ## Configuration - Vocabulary size: 127,989 - Special tokens: 13 - Min frequency: 2 - Training samples: up to 500,000 ## Datasets - nick007x/github-code-2025 (35%) - HuggingFaceFW/fineweb-2 - Lojban (10%) - HuggingFaceFW/fineweb-2 - French (15%) - HuggingFaceFW/fineweb-2 - Chinese (15%) - HuggingFaceFW/fineweb - English (25%) ## Special Tokens ```<|begin|>, <|return|>, <|pad|>, <|start|>, <|channel|>, <|end|>, <|message|>, <|image|>, <|video|>, <|audio|>, <|call|>, <|constrain|>, <|unknown|>``` ## Enforced Vocabulary ```analysis, assistant, commentary, developer, final, json, system, tool, toon, user, yaml``` ## Usage ```python from multistral.multistraltokenizer import MultistralTokenizer tokenizer = MultistralTokenizer.from_pretrained("models/aizia_tokenizer") tokens = tokenizer.encode("Your text here") text = tokenizer.decode(tokens) ```