---
license: mit
language:
- en
tags:
- leetspeak
- text2text-generation
- byt5
- decoder
- translation
- normalization
datasets:
- wikitext
- eli5
metrics:
- bleu
- cer
pipeline_tag: translation
model-index:
- name: ByT5 Leetspeak Decoder V3
  results:
  - task:
      type: translation
      name: Leetspeak Decoding
    metrics:
    - type: accuracy
      name: Mixed-Number Accuracy
      value: 100.0
    - type: accuracy
      name: Basic Leet Accuracy
      value: 100.0
---

# ByT5 Leetspeak Decoder V3 (Production)

**The definitive byte-level translator for leetspeak, internet slang, and visual character obfuscation.**

Built on `google/byt5-base`, **V3** represents a major architectural shift from previous versions. It utilizes **Curriculum Learning** and **Adversarial Filtering** to solve the complex context ambiguity between leetspeak numbers (e.g., "2" meaning "to") and actual quantities (e.g., "2 cats").

## Key Improvements in V3

| Feature | V2 (Legacy) | V3 (Current) |
| :--- | :--- | :--- |
| **Mixed-Number Context** | Struggled (~74%) | **100.0% Accuracy** |
| **Basic Leet Decoding** | 85% | **100.0% Accuracy** |
| **Visual Obfuscation** | Moderate | **High** (handles `|<1||`, `|-|`, etc.) |
| **Output Style** | Casual/Slang-heavy | **Formal/Standard English** |
| **Final Eval Loss** | 0.84 | **0.3812** |

### The "Number Problem" Solved
V3 is the first model in this series to perfectly distinguish between numbers used as letters and numbers used as quantities within the same sentence.
* **Input:** `1t5 2 l8 4 2 people`
* **V2 Output:** *It's to late for to people.* (Fail)
* **V3 Output:** *It is too late for 2 people.* (Pass)

## Usage

```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_id = "ilyyeees/byt5-leetspeak-decoder"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

def decode_leet(text):
    inputs = tokenizer(text, return_tensors="pt")
    outputs = model.generate(
        **inputs, 
        max_length=256,
        num_beams=4,
        early_stopping=True
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test Cases
print(decode_leet("1t5 2 l8 4 th4t"))         
# Output: It is too late for that.

print(decode_leet("1 g0t 100 p01nt5 0n 1t"))   
# Output: I got 100 points on it. (Preserves the '100' but decodes the rest)

print(decode_leet("idk wh4t 2 d0 tbh"))      
# Output: I don't know what to do to be honest. (Expands abbreviations)
```
##Training Methodology
V3 was trained on 2x NVIDIA RTX 5090s using a custom Reverse-Corruption Pipeline:

Clean Base: High-quality English from WikiText and ELI5 to ground the model in correct grammar.

LLM Adversarial Corruption: We used Qwen 2.5 72B to generate "Hard Negatives"—specific leetspeak patterns that previous model versions failed to decode.

Curriculum Learning: The model was trained in phases of increasing difficulty, starting with simple character swaps and ending with complex visual noise and mixed-number ambiguity.

#Limitations & Bias
Formalization Bias: Because V3 was trained on high-quality datasets (Wiki/ELI5), it has a bias toward formal English. It may expand casual slang into formal prose (e.g., converting ngl to not gonna lie or idk to I don't know). It generally avoids outputting slang words like gonna or wanna unless strongly prompted.

Short Inputs: Extremely short, ambiguous inputs (1-2 characters) may be interpreted as standard English rather than leetspeak due to the conservative decoding threshold.

#Links
GitHub Repository: ilyyeees/leet-speak-decoder

V2 Model (Legacy): byt5-leetspeak-decoder-v2