gsaltintas commited on
Commit
fdfb6ff
·
verified ·
1 Parent(s): fae322c

Upload folder using huggingface_hub

Browse files
Files changed (6) hide show
  1. README.md +49 -0
  2. merges.txt +0 -0
  3. special_tokens_map.json +11 -0
  4. tokenizer.json +0 -0
  5. tokenizer_config.json +0 -0
  6. vocab.json +0 -0
README.md ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - und # ISO 639-3 code or "und" if not identifiable
5
+ tags:
6
+ - tokenizer
7
+ - bpe
8
+ - flexitok
9
+ - fineweb2
10
+ ---
11
+
12
+ # Byte-Level BPE Tokenizer: numeric (10K)
13
+
14
+ A **Byte-Level BPE** tokenizer trained on **numeric** data from Fineweb-2-HQ.
15
+
16
+ ## Training Details
17
+
18
+ | Parameter | Value |
19
+ |-----------|-------|
20
+ | Algorithm | Byte-Level BPE |
21
+ | Language | `numeric` |
22
+ | Target Vocab Size | 10,007 |
23
+ | Final Vocab Size | 10,007 |
24
+ | Pre-tokenizer | byte_level |
25
+ | Number handling | rtl_4digit |
26
+ | Contraction handling | False |
27
+ | Normalizer | NONE |
28
+ | Special Tokens | `<s>`, `</s>`, `<pad>`, `<unk>` |
29
+ | Training Shards | 1 |
30
+
31
+ ## Usage
32
+
33
+ ```python
34
+ from transformers import AutoTokenizer
35
+
36
+ tokenizer = AutoTokenizer.from_pretrained("None")
37
+ tokens = tokenizer.encode("Hello, world!")
38
+ ```
39
+
40
+ ## Files
41
+
42
+ - `tokenizer.json` — Full HuggingFace tokenizer
43
+ - `vocab.json` — Vocabulary mapping
44
+ - `merges.txt` — BPE merge rules
45
+
46
+ ## Sample Encoding
47
+ | Text | Tokens | Token IDs |
48
+ |------|--------|-----------|
49
+ | `12345009 mod 67` | `1234, 5009, , mod, , 67` | `11599, 15374, 6, 4, 6, 64` |
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
special_tokens_map.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "mod",
4
+ "=",
5
+ " "
6
+ ],
7
+ "bos_token": "<s>",
8
+ "eos_token": "</s>",
9
+ "pad_token": "<pad>",
10
+ "unk_token": "<unk>"
11
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff
 
vocab.json ADDED
The diff for this file is too large to render. See raw diff