ilyyeees
/

byt5-leetspeak-decoder

text2text-generation

Eval Results (legacy)

Model card Files Files and versions

Metrics Training metrics Community

byt5-leetspeak-decoder / README.md

ilyyeees's picture

Update README.md

e58b728 verified 21 days ago

|

history blame contribute delete

3.67 kB

	---
	license: mit
	language:
	- en
	tags:
	- leetspeak
	- text2text-generation
	- byt5
	- decoder
	- translation
	- normalization
	datasets:
	- wikitext
	- eli5
	metrics:
	- bleu
	- cer
	pipeline_tag: translation
	model-index:
	- name: ByT5 Leetspeak Decoder V3
	results:
	- task:
	type: translation
	name: Leetspeak Decoding
	metrics:
	- type: accuracy
	name: Mixed-Number Accuracy
	value: 100.0
	- type: accuracy
	name: Basic Leet Accuracy
	value: 100.0
	---

	# ByT5 Leetspeak Decoder V3 (Production)

	The definitive byte-level translator for leetspeak, internet slang, and visual character obfuscation.

	Built on `google/byt5-base`, V3 represents a major architectural shift from previous versions. It utilizes Curriculum Learning and Adversarial Filtering to solve the complex context ambiguity between leetspeak numbers (e.g., "2" meaning "to") and actual quantities (e.g., "2 cats").

	## Key Improvements in V3

	\| Feature \| V2 (Legacy) \| V3 (Current) \|
	\| :--- \| :--- \| :--- \|
	\| Mixed-Number Context \| Struggled (~74%) \| 100.0% Accuracy \|
	\| Basic Leet Decoding \| 85% \| 100.0% Accuracy \|
	\| Visual Obfuscation \| Moderate \| High (handles `\|<1\|\|`, `\|-\|`, etc.) \|
	\| Output Style \| Casual/Slang-heavy \| Formal/Standard English \|
	\| Final Eval Loss \| 0.84 \| 0.3812 \|

	### The "Number Problem" Solved
	V3 is the first model in this series to perfectly distinguish between numbers used as letters and numbers used as quantities within the same sentence.
	* Input: `1t5 2 l8 4 2 people`
	* V2 Output: It's to late for to people. (Fail)
	* V3 Output: It is too late for 2 people. (Pass)

	## Usage

	```python
	from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

	model_id = "ilyyeees/byt5-leetspeak-decoder"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

	def decode_leet(text):
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model.generate(
	**inputs,
	max_length=256,
	num_beams=4,
	early_stopping=True
	)
	return tokenizer.decode(outputs[0], skip_special_tokens=True)

	# Test Cases
	print(decode_leet("1t5 2 l8 4 th4t"))
	# Output: It is too late for that.

	print(decode_leet("1 g0t 100 p01nt5 0n 1t"))
	# Output: I got 100 points on it. (Preserves the '100' but decodes the rest)

	print(decode_leet("idk wh4t 2 d0 tbh"))
	# Output: I don't know what to do to be honest. (Expands abbreviations)
	```
	##Training Methodology
	V3 was trained on 2x NVIDIA RTX 5090s using a custom Reverse-Corruption Pipeline:

	Clean Base: High-quality English from WikiText and ELI5 to ground the model in correct grammar.

	LLM Adversarial Corruption: We used Qwen 2.5 72B to generate "Hard Negatives"—specific leetspeak patterns that previous model versions failed to decode.

	Curriculum Learning: The model was trained in phases of increasing difficulty, starting with simple character swaps and ending with complex visual noise and mixed-number ambiguity.

	#Limitations & Bias
	Formalization Bias: Because V3 was trained on high-quality datasets (Wiki/ELI5), it has a bias toward formal English. It may expand casual slang into formal prose (e.g., converting ngl to not gonna lie or idk to I don't know). It generally avoids outputting slang words like gonna or wanna unless strongly prompted.

	Short Inputs: Extremely short, ambiguous inputs (1-2 characters) may be interpreted as standard English rather than leetspeak due to the conservative decoding threshold.

	#Links
	GitHub Repository: ilyyeees/leet-speak-decoder

	V2 Model (Legacy): byt5-leetspeak-decoder-v2