Johnnyman1100
/

EZ-Tokenizer_The_Tokenizer

Model card Files Files and versions

EZ-Tokenizer_The_Tokenizer / README.md

Johnnyman1100's picture

Update README.md

10b07a7 verified 9 months ago

|

history blame contribute delete

1.5 kB

	---
	license: mit
	---
	# EZ-Tokenizer: 3.47 Chars/Token with 100% Reconstruction

	> "Go ahead, try to break it. I dare you." - A tokenizer so efficient, it feels like cheating.

	## 🚀 Performance Highlights
	- 3.47 characters per token (beats industry standards)
	- 100% perfect reconstruction on all test cases
	- 50K vocab size (smaller, smarter, faster)
	- 264K tokens/second processing speed

	## 💥 Benchmark This!
	```python
	from tokenizers import Tokenizer
	tokenizer = Tokenizer.from_pretrained("johnnyman1100/EZ-Tokenizer_The_Tokenizer")

	# Test it yourself
	text = "Your text here"
	encoded = tokenizer.encode(text)
	decoded = tokenizer.decode(encoded.ids)

	assert text == decoded # Try to make this fail, I'll wait...
	print(f"Compression: {len(text)/len(encoded.ids):.2f} chars/token")
	```

	## 🏆 Challenge
	Find any text where this tokenizer:
	1. Fails to reconstruct perfectly, or
	2. Gets worse compression than DeepSeek/others

	First to report a verified case gets a shoutout!

	## 📊 Technical Details
	- Vocabulary: 50,000 tokens
	- Tested on: 1.7M+ characters of mixed content
	- Perfect reconstruction on all test cases
	- Faster than DeepSeek by 1.23x

	## 🤔 Why This Matters
	Because in a world of bloated models, efficiency still wins. This tokenizer proves you don't need 100K+ tokens to achieve perfect reconstruction and better compression.

	## ⚖️ License
	MIT

	---

	"I didn't believe it either until I saw the benchmarks." - You, probably