|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- leetspeak |
|
|
- text2text-generation |
|
|
- byt5 |
|
|
- decoder |
|
|
- translation |
|
|
- normalization |
|
|
datasets: |
|
|
- wikitext |
|
|
- eli5 |
|
|
metrics: |
|
|
- bleu |
|
|
- cer |
|
|
pipeline_tag: translation |
|
|
model-index: |
|
|
- name: ByT5 Leetspeak Decoder V3 |
|
|
results: |
|
|
- task: |
|
|
type: translation |
|
|
name: Leetspeak Decoding |
|
|
metrics: |
|
|
- type: accuracy |
|
|
name: Mixed-Number Accuracy |
|
|
value: 100.0 |
|
|
- type: accuracy |
|
|
name: Basic Leet Accuracy |
|
|
value: 100.0 |
|
|
--- |
|
|
|
|
|
# ByT5 Leetspeak Decoder V3 (Production) |
|
|
|
|
|
**The definitive byte-level translator for leetspeak, internet slang, and visual character obfuscation.** |
|
|
|
|
|
Built on `google/byt5-base`, **V3** represents a major architectural shift from previous versions. It utilizes **Curriculum Learning** and **Adversarial Filtering** to solve the complex context ambiguity between leetspeak numbers (e.g., "2" meaning "to") and actual quantities (e.g., "2 cats"). |
|
|
|
|
|
## Key Improvements in V3 |
|
|
|
|
|
| Feature | V2 (Legacy) | V3 (Current) | |
|
|
| :--- | :--- | :--- | |
|
|
| **Mixed-Number Context** | Struggled (~74%) | **100.0% Accuracy** | |
|
|
| **Basic Leet Decoding** | 85% | **100.0% Accuracy** | |
|
|
| **Visual Obfuscation** | Moderate | **High** (handles `|<1||`, `|-|`, etc.) | |
|
|
| **Output Style** | Casual/Slang-heavy | **Formal/Standard English** | |
|
|
| **Final Eval Loss** | 0.84 | **0.3812** | |
|
|
|
|
|
### The "Number Problem" Solved |
|
|
V3 is the first model in this series to perfectly distinguish between numbers used as letters and numbers used as quantities within the same sentence. |
|
|
* **Input:** `1t5 2 l8 4 2 people` |
|
|
* **V2 Output:** *It's to late for to people.* (Fail) |
|
|
* **V3 Output:** *It is too late for 2 people.* (Pass) |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer |
|
|
|
|
|
model_id = "ilyyeees/byt5-leetspeak-decoder" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForSeq2SeqLM.from_pretrained(model_id) |
|
|
|
|
|
def decode_leet(text): |
|
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_length=256, |
|
|
num_beams=4, |
|
|
early_stopping=True |
|
|
) |
|
|
return tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
|
|
|
# Test Cases |
|
|
print(decode_leet("1t5 2 l8 4 th4t")) |
|
|
# Output: It is too late for that. |
|
|
|
|
|
print(decode_leet("1 g0t 100 p01nt5 0n 1t")) |
|
|
# Output: I got 100 points on it. (Preserves the '100' but decodes the rest) |
|
|
|
|
|
print(decode_leet("idk wh4t 2 d0 tbh")) |
|
|
# Output: I don't know what to do to be honest. (Expands abbreviations) |
|
|
``` |
|
|
##Training Methodology |
|
|
V3 was trained on 2x NVIDIA RTX 5090s using a custom Reverse-Corruption Pipeline: |
|
|
|
|
|
Clean Base: High-quality English from WikiText and ELI5 to ground the model in correct grammar. |
|
|
|
|
|
LLM Adversarial Corruption: We used Qwen 2.5 72B to generate "Hard Negatives"—specific leetspeak patterns that previous model versions failed to decode. |
|
|
|
|
|
Curriculum Learning: The model was trained in phases of increasing difficulty, starting with simple character swaps and ending with complex visual noise and mixed-number ambiguity. |
|
|
|
|
|
#Limitations & Bias |
|
|
Formalization Bias: Because V3 was trained on high-quality datasets (Wiki/ELI5), it has a bias toward formal English. It may expand casual slang into formal prose (e.g., converting ngl to not gonna lie or idk to I don't know). It generally avoids outputting slang words like gonna or wanna unless strongly prompted. |
|
|
|
|
|
Short Inputs: Extremely short, ambiguous inputs (1-2 characters) may be interpreted as standard English rather than leetspeak due to the conservative decoding threshold. |
|
|
|
|
|
#Links |
|
|
GitHub Repository: ilyyeees/leet-speak-decoder |
|
|
|
|
|
V2 Model (Legacy): byt5-leetspeak-decoder-v2 |