ByteMeHarder-404
/

tokenizers

bytepairencoding

Model card Files Files and versions

ByteMeHarder-404 commited on Sep 25, 2025

Commit

e68e4d0

·

verified ·

1 Parent(s): 76a0b02

Create README.md

Files changed (1) hide show

README.md +49 -0

README.md ADDED Viewed

	@@ -0,0 +1,49 @@

+---
+language: en
+tags:
+- tokenizers
+- wordpiece
+- bytepairencoding
+- xlnet
+- nlp
+license: mit
+---
+# Basic Tokenizers Collection
+This repository contains **three different tokenizers** trained and wrapped for experimentation and educational purposes:
+## 📦 Contents
+- **WordPiece Tokenizer**
+  Path: `ByteMeHarder-404/tokenizers/wordpiece`
+  Classic subword tokenizer (used in BERT). Splits words into subword units based on frequency, ensuring full coverage with a compact vocab.
+- **Byte-Pair Encoding (BPE) Tokenizer**
+  Path: `ByteMeHarder-404/tokenizers/bpe`
+  Uses byte-level BPE, similar to GPT-2 and RoBERTa. Handles any UTF-8 character without unknown tokens by working directly on bytes.
+- **XLNet-Style Tokenizer**
+  Path: `ByteMeHarder-404/tokenizers/xlnet`
+  Follows the XLNet tokenization approach, leveraging sentencepiece-like segmentation.
+## 🚀 Usage
+You can load each tokenizer with `transformers`:
+```python
+from transformers import PreTrainedTokenizerFast
+# WordPiece
+tok_wordpiece = PreTrainedTokenizerFast.from_pretrained("ByteMeHarder-404/tokenizers/wordpiece")
+# BPE
+tok_bpe = PreTrainedTokenizerFast.from_pretrained("ByteMeHarder-404/tokenizers/bpe")
+# XLNet-style
+tok_xlnet = PreTrainedTokenizerFast.from_pretrained("ByteMeHarder-404/tokenizers/xlnet")
+```
+## 📚 Notes
+- These tokenizers are minimal examples and **not pretrained with embeddings or models**.
+- Intended for experimentation, educational purposes, and as a foundation for building custom models.
+- You can extend them by training a new vocabulary on your dataset.