gsaltintas commited on
Commit
d9322ce
·
verified ·
1 Parent(s): fd09ac9

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +43 -0
README.md ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - fas
5
+ tags:
6
+ - tokenizer
7
+ - unigram
8
+ - flexitok
9
+ - fineweb2
10
+ ---
11
+
12
+ # UnigramLM Tokenizer: fas_Arab (64K)
13
+
14
+ A **UnigramLM** tokenizer trained on **fas_Arab** data from Fineweb-2-HQ.
15
+
16
+ ## Training Details
17
+
18
+ | Parameter | Value |
19
+ |-----------|-------|
20
+ | Algorithm | UnigramLM |
21
+ | Language | `fas_Arab` |
22
+ | Target Vocab Size | 64,000 |
23
+ | Final Vocab Size | 0 |
24
+ | Pre-tokenizer | ByteLevel |
25
+ | Normalizer | NFC |
26
+ | Special Tokens | `<s>`, `</s>`, `<pad>`, `<unk>` |
27
+ | Training Shards | 2 |
28
+ | Data Source | `/scratch/gsa/data/flexitok//fas_Arab/` |
29
+
30
+ ## Usage
31
+
32
+ ```python
33
+ from transformers import AutoTokenizer
34
+
35
+ tokenizer = AutoTokenizer.from_pretrained("<repo_id>")
36
+ tokens = tokenizer.encode("Hello, world!")
37
+ ```
38
+
39
+ ## Files
40
+
41
+ - `tokenizer.json` — Full HuggingFace tokenizer
42
+ - `vocab.json` — Vocabulary mapping
43
+ - `tokenizer.model` — SentencePiece protobuf (if available)