damfle commited on
Commit
13a5ba1
·
unverified ·
1 Parent(s): b54dd21

doc(readme): add languages detail

Browse files
Files changed (1) hide show
  1. README.md +7 -7
README.md CHANGED
@@ -23,16 +23,16 @@ Training completed successfully!
23
 
24
  ## Datasets
25
  - nick007x/github-code-2025 (35%)
26
- - HuggingFaceFW/fineweb-2 (10%)
27
- - HuggingFaceFW/fineweb-2 (15%)
28
- - HuggingFaceFW/fineweb-2 (15%)
29
- - HuggingFaceFW/fineweb (25%)
30
 
31
  ## Special Tokens
32
- <|begin|>, <|return|>, <|pad|>, <|start|>, <|channel|>, <|end|>, <|message|>, <|image|>, <|video|>, <|audio|>, <|call|>, <|constrain|>, <|unknown|>
33
 
34
  ## Enforced Vocabulary
35
- analysis, assistant, commentary, developer, final, json, system, tool, toon, user, yaml
36
 
37
  ## Usage
38
 
@@ -42,4 +42,4 @@ from multistral.multistraltokenizer import MultistralTokenizer
42
  tokenizer = MultistralTokenizer.from_pretrained("models/aizia_tokenizer")
43
  tokens = tokenizer.encode("Your text here")
44
  text = tokenizer.decode(tokens)
45
- ```
 
23
 
24
  ## Datasets
25
  - nick007x/github-code-2025 (35%)
26
+ - HuggingFaceFW/fineweb-2 - Lojban (10%)
27
+ - HuggingFaceFW/fineweb-2 - French (15%)
28
+ - HuggingFaceFW/fineweb-2 - Chinese (15%)
29
+ - HuggingFaceFW/fineweb - English (25%)
30
 
31
  ## Special Tokens
32
+ ```<|begin|>, <|return|>, <|pad|>, <|start|>, <|channel|>, <|end|>, <|message|>, <|image|>, <|video|>, <|audio|>, <|call|>, <|constrain|>, <|unknown|>```
33
 
34
  ## Enforced Vocabulary
35
+ ```analysis, assistant, commentary, developer, final, json, system, tool, toon, user, yaml```
36
 
37
  ## Usage
38
 
 
42
  tokenizer = MultistralTokenizer.from_pretrained("models/aizia_tokenizer")
43
  tokens = tokenizer.encode("Your text here")
44
  text = tokenizer.decode(tokens)
45
+ ```