Rogarcia18 commited on
Commit
7cd0c06
·
verified ·
1 Parent(s): c9e63eb

Upload BPE tokenizer trained on WikiText-2

Browse files
Files changed (1) hide show
  1. README.md +31 -0
README.md ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ license: apache-2.0
4
+ tags:
5
+ - tokenizer
6
+ - bpe
7
+ - wikitext2
8
+ - nlp
9
+ ---
10
+
11
+ # WikiText-2 BPE Tokenizer
12
+
13
+ A Byte Pair Encoding (BPE) tokenizer trained on the WikiText-2 dataset.
14
+
15
+ ## Model Details
16
+ - **Vocabulary Size**: 30,000 tokens
17
+ - **Training Data**: WikiText-2 (Salesforce/wikitext)
18
+ - **Special Tokens**: [PAD], [UNK], [CLS], [SEP], [MASK]
19
+ - **Compression Ratio**: ~6.4 characters per token
20
+
21
+ ## Usage
22
+ ```python
23
+ from transformers import AutoTokenizer
24
+
25
+ tokenizer = AutoTokenizer.from_pretrained("Rogarcia18/wikitext2-bpe-tokenizer")
26
+ ```
27
+
28
+ ## Training Details
29
+ - Dataset: WikiText-2 (wikitext-2-v1)
30
+ - Preprocessing: Deduplication, <unk> removal, whitespace normalization, remove samples cases with less than 10 characters
31
+ - Architecture: BPE with HuggingFace tokenizers library