|
|
--- |
|
|
language: en |
|
|
tags: |
|
|
- tokenizers |
|
|
- wordpiece |
|
|
- bytepairencoding |
|
|
- xlnet |
|
|
- nlp |
|
|
license: mit |
|
|
--- |
|
|
|
|
|
# Basic Tokenizers Collection |
|
|
|
|
|
This repository contains **three different tokenizers** trained and wrapped for experimentation and educational purposes: |
|
|
|
|
|
## π¦ Contents |
|
|
- **WordPiece Tokenizer** |
|
|
Path: `ByteMeHarder-404/tokenizers/wordpiece` |
|
|
Classic subword tokenizer (used in BERT). Splits words into subword units based on frequency, ensuring full coverage with a compact vocab. |
|
|
|
|
|
- **Byte-Pair Encoding (BPE) Tokenizer** |
|
|
Path: `ByteMeHarder-404/tokenizers/bpe` |
|
|
Uses byte-level BPE, similar to GPT-2 and RoBERTa. Handles any UTF-8 character without unknown tokens by working directly on bytes. |
|
|
|
|
|
- **XLNet-Style Tokenizer** |
|
|
Path: `ByteMeHarder-404/tokenizers/xlnet` |
|
|
Follows the XLNet tokenization approach, leveraging sentencepiece-like segmentation. |
|
|
|
|
|
## π Usage |
|
|
You can load each tokenizer with `transformers`: |
|
|
|
|
|
```python |
|
|
from transformers import PreTrainedTokenizerFast |
|
|
|
|
|
# WordPiece |
|
|
tok_wordpiece = PreTrainedTokenizerFast.from_pretrained("ByteMeHarder-404/tokenizers/wordpiece") |
|
|
|
|
|
# BPE |
|
|
tok_bpe = PreTrainedTokenizerFast.from_pretrained("ByteMeHarder-404/tokenizers/bpe") |
|
|
|
|
|
# XLNet-style |
|
|
tok_xlnet = PreTrainedTokenizerFast.from_pretrained("ByteMeHarder-404/tokenizers/xlnet") |
|
|
``` |
|
|
|
|
|
## π Notes |
|
|
|
|
|
- These tokenizers are minimal examples and **not pretrained with embeddings or models**. |
|
|
- Intended for experimentation, educational purposes, and as a foundation for building custom models. |
|
|
- You can extend them by training a new vocabulary on your dataset. |
|
|
|