ilyyeees's picture
Update README.md
e58b728 verified
---
license: mit
language:
- en
tags:
- leetspeak
- text2text-generation
- byt5
- decoder
- translation
- normalization
datasets:
- wikitext
- eli5
metrics:
- bleu
- cer
pipeline_tag: translation
model-index:
- name: ByT5 Leetspeak Decoder V3
results:
- task:
type: translation
name: Leetspeak Decoding
metrics:
- type: accuracy
name: Mixed-Number Accuracy
value: 100.0
- type: accuracy
name: Basic Leet Accuracy
value: 100.0
---
# ByT5 Leetspeak Decoder V3 (Production)
**The definitive byte-level translator for leetspeak, internet slang, and visual character obfuscation.**
Built on `google/byt5-base`, **V3** represents a major architectural shift from previous versions. It utilizes **Curriculum Learning** and **Adversarial Filtering** to solve the complex context ambiguity between leetspeak numbers (e.g., "2" meaning "to") and actual quantities (e.g., "2 cats").
## Key Improvements in V3
| Feature | V2 (Legacy) | V3 (Current) |
| :--- | :--- | :--- |
| **Mixed-Number Context** | Struggled (~74%) | **100.0% Accuracy** |
| **Basic Leet Decoding** | 85% | **100.0% Accuracy** |
| **Visual Obfuscation** | Moderate | **High** (handles `|<1||`, `|-|`, etc.) |
| **Output Style** | Casual/Slang-heavy | **Formal/Standard English** |
| **Final Eval Loss** | 0.84 | **0.3812** |
### The "Number Problem" Solved
V3 is the first model in this series to perfectly distinguish between numbers used as letters and numbers used as quantities within the same sentence.
* **Input:** `1t5 2 l8 4 2 people`
* **V2 Output:** *It's to late for to people.* (Fail)
* **V3 Output:** *It is too late for 2 people.* (Pass)
## Usage
```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_id = "ilyyeees/byt5-leetspeak-decoder"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
def decode_leet(text):
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(
**inputs,
max_length=256,
num_beams=4,
early_stopping=True
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Test Cases
print(decode_leet("1t5 2 l8 4 th4t"))
# Output: It is too late for that.
print(decode_leet("1 g0t 100 p01nt5 0n 1t"))
# Output: I got 100 points on it. (Preserves the '100' but decodes the rest)
print(decode_leet("idk wh4t 2 d0 tbh"))
# Output: I don't know what to do to be honest. (Expands abbreviations)
```
##Training Methodology
V3 was trained on 2x NVIDIA RTX 5090s using a custom Reverse-Corruption Pipeline:
Clean Base: High-quality English from WikiText and ELI5 to ground the model in correct grammar.
LLM Adversarial Corruption: We used Qwen 2.5 72B to generate "Hard Negatives"—specific leetspeak patterns that previous model versions failed to decode.
Curriculum Learning: The model was trained in phases of increasing difficulty, starting with simple character swaps and ending with complex visual noise and mixed-number ambiguity.
#Limitations & Bias
Formalization Bias: Because V3 was trained on high-quality datasets (Wiki/ELI5), it has a bias toward formal English. It may expand casual slang into formal prose (e.g., converting ngl to not gonna lie or idk to I don't know). It generally avoids outputting slang words like gonna or wanna unless strongly prompted.
Short Inputs: Extremely short, ambiguous inputs (1-2 characters) may be interpreted as standard English rather than leetspeak due to the conservative decoding threshold.
#Links
GitHub Repository: ilyyeees/leet-speak-decoder
V2 Model (Legacy): byt5-leetspeak-decoder-v2