gsaltintas commited on
Commit
b3e0464
·
verified ·
1 Parent(s): b47e774

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +43 -0
README.md ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - fw
5
+ tags:
6
+ - tokenizer
7
+ - bpe
8
+ - flexitok
9
+ - fineweb2
10
+ ---
11
+
12
+ # Byte-Level BPE Tokenizer: fw_edu (32K)
13
+
14
+ A **Byte-Level BPE** tokenizer trained on **fw_edu** data from Fineweb-2-HQ.
15
+
16
+ ## Training Details
17
+
18
+ | Parameter | Value |
19
+ |-----------|-------|
20
+ | Algorithm | Byte-Level BPE |
21
+ | Language | `fw_edu` |
22
+ | Target Vocab Size | 32,000 |
23
+ | Final Vocab Size | 0 |
24
+ | Pre-tokenizer | ByteLevel |
25
+ | Normalizer | NFC |
26
+ | Special Tokens | `<s>`, `</s>`, `<pad>`, `<unk>` |
27
+ | Training Shards | 2 |
28
+ | Data Source | `/scratch/gsa/data/flexitok//fw_edu/` |
29
+
30
+ ## Usage
31
+
32
+ ```python
33
+ from transformers import AutoTokenizer
34
+
35
+ tokenizer = AutoTokenizer.from_pretrained("<repo_id>")
36
+ tokens = tokenizer.encode("Hello, world!")
37
+ ```
38
+
39
+ ## Files
40
+
41
+ - `tokenizer.json` — Full HuggingFace tokenizer
42
+ - `vocab.json` — Vocabulary mapping
43
+ - `merges.txt` — BPE merge rules