gsaltintas commited on
Commit
0d82a7b
·
verified ·
1 Parent(s): f497908

Upload folder using huggingface_hub

Browse files
Files changed (6) hide show
  1. README.md +52 -0
  2. merges.txt +0 -0
  3. special_tokens_map.json +5 -0
  4. tokenizer.json +0 -0
  5. tokenizer_config.json +0 -0
  6. vocab.json +0 -0
README.md ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - fra
5
+ - ita
6
+ - por
7
+ - spa #['fra_Latn', 'ita_Latn', 'por_Latn', 'spa_Latn'] # ISO 639-3 code or "und" if not identifiable
8
+ tags:
9
+ - tokenizer
10
+ - bpe
11
+ - flexitok
12
+ - fineweb2
13
+ ---
14
+
15
+ # Byte-Level BPE Tokenizer: ['fra_Latn', 'ita_Latn', 'por_Latn', 'spa_Latn'] (32K)
16
+
17
+ A **Byte-Level BPE** tokenizer trained on **['fra_Latn', 'ita_Latn', 'por_Latn', 'spa_Latn']** data from Fineweb-2-HQ.
18
+
19
+ ## Training Details
20
+
21
+ | Parameter | Value |
22
+ |-----------|-------|
23
+ | Algorithm | Byte-Level BPE |
24
+ | Language | `['fra_Latn', 'ita_Latn', 'por_Latn', 'spa_Latn']` |
25
+ | Target Vocab Size | 32,000 |
26
+ | Final Vocab Size | 32,871 |
27
+ | Pre-tokenizer | custom:fra_Latn |
28
+ | Number handling | ltr_3digit |
29
+ | Contraction handling | True |
30
+ | Normalizer | NFC |
31
+ | Special Tokens | `<s>`, `</s>`, `<pad>`, `<unk>` |
32
+ | Training Shards | 8 |
33
+
34
+ ## Usage
35
+
36
+ ```python
37
+ from transformers import AutoTokenizer
38
+
39
+ tokenizer = AutoTokenizer.from_pretrained("flexitok/bpe_script_Roma_32000")
40
+ tokens = tokenizer.encode("Hello, world!")
41
+ ```
42
+
43
+ ## Files
44
+
45
+ - `tokenizer.json` — Full HuggingFace tokenizer
46
+ - `vocab.json` — Vocabulary mapping
47
+ - `merges.txt` — BPE merge rules
48
+
49
+ ## Sample Encoding
50
+ | Text | Tokens | Token IDs |
51
+ |------|--------|-----------|
52
+ | `Hello, world! 12345 This is a test. こんにちは` | `H, ello, ,, Ġworld, !, Ġ, 123, 45, ĠThis, Ġis, Ġa, Ġtest, ., Ġ, ãģ, ĵ, ãĤ, ĵ, ãģ, «` | `42, 2110, 14, 25291, 3, 223, 22415, 4328, 17636, 1008, 267, 3037, 16, 223, 8090, 244, 14187, 244, 8090, 107` |
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
special_tokens_map.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "eos_token": "</s>",
4
+ "pad_token": "<pad>"
5
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff
 
vocab.json ADDED
The diff for this file is too large to render. See raw diff