ByteMeHarder-404 commited on
Commit
e68e4d0
·
verified ·
1 Parent(s): 76a0b02

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -0
README.md ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ tags:
4
+ - tokenizers
5
+ - wordpiece
6
+ - bytepairencoding
7
+ - xlnet
8
+ - nlp
9
+ license: mit
10
+ ---
11
+
12
+ # Basic Tokenizers Collection
13
+
14
+ This repository contains **three different tokenizers** trained and wrapped for experimentation and educational purposes:
15
+
16
+ ## 📦 Contents
17
+ - **WordPiece Tokenizer**
18
+ Path: `ByteMeHarder-404/tokenizers/wordpiece`
19
+ Classic subword tokenizer (used in BERT). Splits words into subword units based on frequency, ensuring full coverage with a compact vocab.
20
+
21
+ - **Byte-Pair Encoding (BPE) Tokenizer**
22
+ Path: `ByteMeHarder-404/tokenizers/bpe`
23
+ Uses byte-level BPE, similar to GPT-2 and RoBERTa. Handles any UTF-8 character without unknown tokens by working directly on bytes.
24
+
25
+ - **XLNet-Style Tokenizer**
26
+ Path: `ByteMeHarder-404/tokenizers/xlnet`
27
+ Follows the XLNet tokenization approach, leveraging sentencepiece-like segmentation.
28
+
29
+ ## 🚀 Usage
30
+ You can load each tokenizer with `transformers`:
31
+
32
+ ```python
33
+ from transformers import PreTrainedTokenizerFast
34
+
35
+ # WordPiece
36
+ tok_wordpiece = PreTrainedTokenizerFast.from_pretrained("ByteMeHarder-404/tokenizers/wordpiece")
37
+
38
+ # BPE
39
+ tok_bpe = PreTrainedTokenizerFast.from_pretrained("ByteMeHarder-404/tokenizers/bpe")
40
+
41
+ # XLNet-style
42
+ tok_xlnet = PreTrainedTokenizerFast.from_pretrained("ByteMeHarder-404/tokenizers/xlnet")
43
+ ```
44
+
45
+ ## 📚 Notes
46
+
47
+ - These tokenizers are minimal examples and **not pretrained with embeddings or models**.
48
+ - Intended for experimentation, educational purposes, and as a foundation for building custom models.
49
+ - You can extend them by training a new vocabulary on your dataset.