twitch-tokenizer / README.md

Add README

be5f8fc verified about 2 months ago

4.22 kB

	---
	tags:
	- tokenizer
	- bpe
	- twitch
	- chat
	- nlp
	license: mit
	library_name: none
	language:
	- en
	pretty_name: Twitch BPE Tokenizer
	---

	# Twitch BPE Tokenizer

	A compact, production-ready Byte Pair Encoding (BPE) tokenizer trained on large-scale Twitch chat. It focuses on:

	- Efficient compression at word granularity while preserving roundtrip fidelity
	- Stable fast-training on very large corpora (sampling + batch merges)
	- Sensible cleaning and macro protection for URLs, mentions, emojis, and whitespace

	This repository hosts the tokenizer artifacts needed to use the model in your own pipelines or applications.

	## Highlights
	- Granularity: word (with byte fallback for out-of-vocab code points)
	- Typical vocab size: ~53k (configurable at train time)
	- Validation metrics (indicative): entropy compression ratio ≈ 3.0; roundtrip ≈ 0.997
	- Space-attached merges (GPT-2 style) for better compression of common prefixes

	> Note: Metrics are data-dependent and will vary by dataset and exact configuration. Values above reflect typical results on held-out Twitch chat samples of ~200k lines with ~53k vocab.

	## Intended Use & Limitations
	- Intended for tokenizing informal, high-throughput chat or social text for downstream modeling or analytics.
	- The training corpus (Twitch chat) can contain colloquialisms, slang, and potentially offensive content. Exercise care when deploying in user-facing contexts.
	- Not a language model. It provides tokenization only (vocab, merges, and config).

	## Training Data & Cleaning
	- Source: Twitch chat style messages (user-provided corpus).
	- Normalization: NFC; controls stripped; repeated characters capped; whitespace collapsed where safe.
	- Placeholders/macros: URLs → `<URL>`, user mentions → `<USER>`; emojis and some patterns protected.
	- Granularity during training: `word` with macro protection; fallback to bytes for unseen characters.

	## Tokenization Details
	- Pretokenization splits into words and whitespace tokens, preserving structure required for good merges.
	- Byte fallback ensures roundtrip for any Unicode inputs.
	- Special tokens reserved with fixed IDs: `<PAD>`, `<UNK>`, `<BOS>`, `<EOS>`.
	- Merge strategy during training includes batch selection, bucket caps to avoid overshoot, and early-stop on stable ratio improvements.

	## Files
	- `bpe_vocab.json` — token → id mapping
	- `bpe_merges.txt` — merge rules (ordered)
	- `tokenizer_config.json` — basic tokenizer configuration

	## Quickstart
	```python
	from huggingface_hub import snapshot_download
	from src.bpe_tokenizer import BPETokenizer
	from src import config as CFG

	repo_id = "Soldier-Boy/twitch-tokenizer"

	# Download all artifacts into a local directory
	local_dir = snapshot_download(repo_id)

	# Optional: align granularity with training (recommended)
	CFG.TOKEN_GRANULARITY = 'word'

	# Initialize and use
	tok = BPETokenizer(tokenizer_dir=local_dir)
	text = "hey @user check this: https://t.tv/AbC"
	ids = tok.encode(text)
	print(ids)
	print(tok.decode(ids))
	```

	### Using without importing this repo
	If you don't want to depend on this repository's code, copy the `BPETokenizer` class (and its small dependencies) into your project and point it to a folder containing the three files above.

	## Metrics & Evaluation (indicative)
	- Compression ratio (entropy): ~3.03 on ~200k validation lines
	- Roundtrip accuracy: ~0.997 (exact byte-for-byte decoding for most lines)

	Evaluation is performed with word granularity, entropy ratio metric, and a ~53k vocabulary. For fixed-length codes, ratios will differ by token distribution.

	## Reproducibility
	Train/evaluate end-to-end with the included pipeline (from this repo):
	```bash
	# Train (example; adjust paths and targets)
	python run_pipeline.py --data_path data/twitch_chat.parquet --text_column Message --vocab_target 56000 --granularity word --ratio_metric entropy --fast_train

	# Artifacts will be written to the tokenizer/ directory
	```

	## Changelog
	- v1.0: Public release. Stable fast-training, macro protection, and strong validation metrics with ~53k vocab.

	## License
	MIT License

	## Acknowledgments
	Thanks to the open-source tokenizer literature and community work around BPE and text normalization.