damfle
/

multistral-tokenizer

Token Classification

Model card Files Files and versions

multistral-tokenizer / README.md

damfle's picture

doc(readme): add missing dataset

5e1ff54 verified 24 days ago

|

history blame contribute delete

1.1 kB

	---
	license: isc
	datasets:
	- HuggingFaceFW/fineweb
	- HuggingFaceFW/fineweb-2
	- nick007x/github-code-2025
	language:
	- fr
	- en
	- zh
	pipeline_tag: token-classification
	tags:
	- code
	---
	# Multistral Tokenizer

	Training completed successfully!

	## Configuration
	- Vocabulary size: 127,989
	- Special tokens: 13
	- Min frequency: 2
	- Training samples: up to 500,000

	## Datasets
	- nick007x/github-code-2025 (35%)
	- HuggingFaceFW/fineweb-2 - Lojban (10%)
	- HuggingFaceFW/fineweb-2 - French (15%)
	- HuggingFaceFW/fineweb-2 - Chinese (15%)
	- HuggingFaceFW/fineweb - English (25%)

	## Special Tokens
	```<\|begin\|>, <\|return\|>, <\|pad\|>, <\|start\|>, <\|channel\|>, <\|end\|>, <\|message\|>, <\|image\|>, <\|video\|>, <\|audio\|>, <\|call\|>, <\|constrain\|>, <\|unknown\|>```

	## Enforced Vocabulary
	```analysis, assistant, commentary, developer, final, json, system, tool, toon, user, yaml```

	## Usage

	```python
	from multistral.multistraltokenizer import MultistralTokenizer

	tokenizer = MultistralTokenizer.from_pretrained("models/aizia_tokenizer")
	tokens = tokenizer.encode("Your text here")
	text = tokenizer.decode(tokens)
	```